Re: Best strategy migrate indexes

2022-11-07 Thread Trejkaz
The process itself sounds like it should work (it's basically a
reindex so it should be safer than trying to migrate directly.)

I would check that the Luke version matches the Lucene version - if
the two match, it shouldn't be possible to get issues like this.
That is, the precise versions of Lucene each is using.

TX


On Mon, 7 Nov 2022 at 22:09, Pablo Vázquez Blázquez  wrote:
>
> Hi!
>
> > I am trying to create a tool to read docs from a lucene5 index and
> generate lucene9 documents from them (with docValues). That might work,
> right? I am shading both lucene5 and lucene9 to avoid package conflicts.
>
> I am doing the following steps:
>
> - create IndexReader with lucene5 package over a lucene5 index
> - create IndexWriter with lucene7 package
> - iterate over reader.numDocs() to process each Document (lucene5)
> - convert each Document (lucene5) to lucene7 Document
> - for each IndexableField (lucene5) from Document (lucene5) convert
> it to create an IndexableField (lucene7)
> - create a SortedDocValuesField (lucene7) and add it to the
> Document (lucene7)
> - add the field to the Document (lucene7)
> - add each converted Document to the writer
> - close  IndexReader and IndexWriter
>
> When I open the resulting migrated lucene7 index with Luke I got an error:
> org.apache.lucene.index.IndexFormatTooNewException: Format version is not
> supported (resource
> BufferedChecksumIndexInput(MMapIndexInput(path="tests_small_index-7.x-migrator\segments_1"))):
> 9 (needs to be between 6 and 7)
>
> When I use the tool "luceneupgrader
> <https://github.com/hakanai/luceneupgrader>", I got:
> java -jar luceneupgrader-0.5.2-SNAPSHOT.jar info
> tests_small_index-7.x-migrator
> Lucene index version: 7
>
> What am I doing wrong or misleading?
>
> Thanks!
>
> El mié, 2 nov 2022 a las 21:13, Pablo Vázquez Blázquez ()
> escribió:
>
> > Hi,
> >
> > Luckily we were already using lucenemigrator
> >
> >
> > What do you mean with "lucenemigrator"? Is it a public tool?
> >
> > I am trying to create a tool to read docs from a lucene5 index and
> > generate lucene9 documents from them (with docValues). That might work,
> > right? I am shading both lucene5 and lucene9 to avoid package conflicts.
> >
> > Thanks!
> >
> > El mar, 1 nov 2022 a las 0:35, Trejkaz () escribió:
> >
> >> Well...
> >>
> >> There's a way, but I wouldn't necessarily recommend it.
> >>
> >> You can write custom migration code against some version of Lucene
> >> which supports doc values, to create doc values fields. It's going to
> >> involve writing a FilterCodecReader which wraps your real index and
> >> then pretends to also have doc values, which you'll build in a custom
> >> class which works similarly to UninvertingReader. Then you pass those
> >> CodecReaders to IndexWriter.addIndexes to create a new index which
> >> really has those doc values.
> >>
> >> We did that ourselves when we had the same issue. The only painful
> >> thing about it is having to keep around older versions of lucene to do
> >> that migration. Forever. Luckily we were already using lucenemigrator,
> >> which has the older versions baked into it with package prefixes. So
> >> that library will get fatter and fatter over time but at least our own
> >> code only gets fatter at the rate migrations are added.
> >>
> >> The same approach works for any other kind of ad-hoc migration you
> >> might want to perform. e.g., you might want to create points. Or
> >> remove an index for a field. Or add an index for a field.
> >>
> >> TX
> >>
> >>
> >> On Tue, 1 Nov 2022 at 02:57, Pablo Vázquez Blázquez 
> >> wrote:
> >> >
> >> > Hi all,
> >> >
> >> > Thank you all for your responses.
> >> >
> >> > So, when updating to a newer (major) Lucene version that modifies its
> >> > codecs, there is no way to ensure everything keeps working properly,
> >> unless
> >> > re-indexing, right?
> >> >
> >> > Apart from not having some original sources that were indexed (which I
> >> will
> >> > try to solve by using the *IndexUpgrader *tool), I have another
> >> problem: I
> >> > was using the org.apache.lucene.uninverting.UninvertingReader to perform
> >> > queries against the index, mainly using the grouping api. But
> >> currently, it
> >> > was removed (since 

Re: Best strategy migrate indexes

2022-11-02 Thread Trejkaz
Was a typo, meant to say luceneupgrader.

And by itself, it won't do any kind of work to convert fields between
different types.
For that, you have to do what I described.

TX

On Thu, 3 Nov 2022 at 07:14, Pablo Vázquez Blázquez  wrote:
>
> Hi,
>
> Luckily we were already using lucenemigrator
>
>
> What do you mean with "lucenemigrator"? Is it a public tool?
>
> I am trying to create a tool to read docs from a lucene5 index and generate
> lucene9 documents from them (with docValues). That might work, right? I am
> shading both lucene5 and lucene9 to avoid package conflicts.
>
> Thanks!
>
> El mar, 1 nov 2022 a las 0:35, Trejkaz () escribió:
>
> > Well...
> >
> > There's a way, but I wouldn't necessarily recommend it.
> >
> > You can write custom migration code against some version of Lucene
> > which supports doc values, to create doc values fields. It's going to
> > involve writing a FilterCodecReader which wraps your real index and
> > then pretends to also have doc values, which you'll build in a custom
> > class which works similarly to UninvertingReader. Then you pass those
> > CodecReaders to IndexWriter.addIndexes to create a new index which
> > really has those doc values.
> >
> > We did that ourselves when we had the same issue. The only painful
> > thing about it is having to keep around older versions of lucene to do
> > that migration. Forever. Luckily we were already using lucenemigrator,
> > which has the older versions baked into it with package prefixes. So
> > that library will get fatter and fatter over time but at least our own
> > code only gets fatter at the rate migrations are added.
> >
> > The same approach works for any other kind of ad-hoc migration you
> > might want to perform. e.g., you might want to create points. Or
> > remove an index for a field. Or add an index for a field.
> >
> > TX
> >
> >
> > On Tue, 1 Nov 2022 at 02:57, Pablo Vázquez Blázquez 
> > wrote:
> > >
> > > Hi all,
> > >
> > > Thank you all for your responses.
> > >
> > > So, when updating to a newer (major) Lucene version that modifies its
> > > codecs, there is no way to ensure everything keeps working properly,
> > unless
> > > re-indexing, right?
> > >
> > > Apart from not having some original sources that were indexed (which I
> > will
> > > try to solve by using the *IndexUpgrader *tool), I have another problem:
> > I
> > > was using the org.apache.lucene.uninverting.UninvertingReader to perform
> > > queries against the index, mainly using the grouping api. But currently,
> > it
> > > was removed (since Lucene 7.0). So, again, do I have any other
> > alternative,
> > > apart from re-indexing to use docValues?
> > >
> > > To give you more context, I am a developer of a tool that multiple
> > > customers can use to index their data (currently, with Lucene 5.5.5). We
> > > are planning to upgrade to Lucene 9 (because of some vulnerabilities
> > > affecting Lucene 5.5.5) and I think asking them to reindex will not go
> > down
> > > well :(
> > >
> > > Regards,
> > >
> > > El sáb, 29 oct 2022 a las 23:31, Matt Davis ()
> > > escribió:
> > >
> > > > Inside of Zulia search engine, the object being indexed is always a
> > > > JSON/BSON object and we store the BSON as a stored byte field in the
> > > > index.  This allows easy internal reindexing when the searchable fields
> > > > change but also allows us to update to the latest lucene version.
> > > >  Combined with using lucene-backward-codecs an older index than the
> > current
> > > > major version can be opened and reindexed.  If you have stored all the
> > > > fields (or a json/bson) in the index, it would be easy to reindex in
> > the
> > > > new format.  If you have not, maybe opening with lucene-backward-codecs
> > > > will be enough for your use case.
> > > >
> > > > Thanks,
> > > > Matt
> > > >
> > > > On Sat, Oct 29, 2022 at 2:30 PM Baris Kazar 
> > > > wrote:
> > > >
> > > > > It is always great practice to retain non-indexed
> > > > > data since when Lucene changes version,
> > > > > even minor version, I always reindex.
> > > > >
> > > > > Best regards
> > > > > 
> > > > > From: Gus Heck 
&

Re: Best strategy migrate indexes

2022-10-31 Thread Trejkaz
Well...

There's a way, but I wouldn't necessarily recommend it.

You can write custom migration code against some version of Lucene
which supports doc values, to create doc values fields. It's going to
involve writing a FilterCodecReader which wraps your real index and
then pretends to also have doc values, which you'll build in a custom
class which works similarly to UninvertingReader. Then you pass those
CodecReaders to IndexWriter.addIndexes to create a new index which
really has those doc values.

We did that ourselves when we had the same issue. The only painful
thing about it is having to keep around older versions of lucene to do
that migration. Forever. Luckily we were already using lucenemigrator,
which has the older versions baked into it with package prefixes. So
that library will get fatter and fatter over time but at least our own
code only gets fatter at the rate migrations are added.

The same approach works for any other kind of ad-hoc migration you
might want to perform. e.g., you might want to create points. Or
remove an index for a field. Or add an index for a field.

TX


On Tue, 1 Nov 2022 at 02:57, Pablo Vázquez Blázquez  wrote:
>
> Hi all,
>
> Thank you all for your responses.
>
> So, when updating to a newer (major) Lucene version that modifies its
> codecs, there is no way to ensure everything keeps working properly, unless
> re-indexing, right?
>
> Apart from not having some original sources that were indexed (which I will
> try to solve by using the *IndexUpgrader *tool), I have another problem: I
> was using the org.apache.lucene.uninverting.UninvertingReader to perform
> queries against the index, mainly using the grouping api. But currently, it
> was removed (since Lucene 7.0). So, again, do I have any other alternative,
> apart from re-indexing to use docValues?
>
> To give you more context, I am a developer of a tool that multiple
> customers can use to index their data (currently, with Lucene 5.5.5). We
> are planning to upgrade to Lucene 9 (because of some vulnerabilities
> affecting Lucene 5.5.5) and I think asking them to reindex will not go down
> well :(
>
> Regards,
>
> El sáb, 29 oct 2022 a las 23:31, Matt Davis ()
> escribió:
>
> > Inside of Zulia search engine, the object being indexed is always a
> > JSON/BSON object and we store the BSON as a stored byte field in the
> > index.  This allows easy internal reindexing when the searchable fields
> > change but also allows us to update to the latest lucene version.
> >  Combined with using lucene-backward-codecs an older index than the current
> > major version can be opened and reindexed.  If you have stored all the
> > fields (or a json/bson) in the index, it would be easy to reindex in the
> > new format.  If you have not, maybe opening with lucene-backward-codecs
> > will be enough for your use case.
> >
> > Thanks,
> > Matt
> >
> > On Sat, Oct 29, 2022 at 2:30 PM Baris Kazar 
> > wrote:
> >
> > > It is always great practice to retain non-indexed
> > > data since when Lucene changes version,
> > > even minor version, I always reindex.
> > >
> > > Best regards
> > > 
> > > From: Gus Heck 
> > > Sent: Saturday, October 29, 2022 2:17 PM
> > > To: java-user@lucene.apache.org 
> > > Subject: Re: Best strategy migrate indexes
> > >
> > > Hi Pablo,
> > >
> > > The deafening silence is probably nobody wanting to give you the bad
> > news.
> > > You are on a mission that may not be feasible, and even if you can get it
> > > to "work", the end result won't likely be equivalent to indexing the
> > > original data with Lucene 9.x. The indexing process is fundamentally
> > lossy
> > > and information originally used to produce non-stored fields will have
> > been
> > > thrown out. A simple example is things like stopwords or anything
> > analyzed
> > > with subclasses of FilteringTokenFilter. If the stop word list changed,
> > or
> > > the details of one of these filters changed (bugfix?), you will end up
> > with
> > > a different result than indexing with 9.x. This is just one
> > > example, another would be stemming where the index likely only contains
> > the
> > > stem, not the whole word. Other folks who are more interested in the
> > > details of our codecs than I am can probably provide further examples on
> > a
> > > more fundamental level. Lucene is not a database, and the source
> > documents
> > > should always be retained in a form that can be reindexed. If you have
> > > inherited a system where source material has not been retained, you have
> > a
> > > difficult project and may have some potentially painful expectation
> > setting
> > > to perform.
> > >
> > > Best,
> > > Gus
> > >
> > >
> > >
> > > On Fri, Oct 28, 2022 at 8:01 AM Pablo Vázquez Blázquez <
> > pabl...@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I have some indices indexed with lucene 5.5.0. I have updated my
> > > > dependencies and code to Lucene 7 (but my final goal is to use Lucene
> > 9)
> 

Re: TermPositions (Lucene 3.3) replacement?

2021-07-08 Thread Trejkaz
Hi.

What you probably want is `TermsEnum.postings(PostingsEnum reuse, int
flags)`, with `PostingsEnum.POSITIONS` in the flags.

I'd also recommend using the `TermsEnum` to iterate the terms instead
of using your own loop, as working with postings works better if you
do.

TX

On Fri, 9 Jul 2021 at 03:11, TK Solr  wrote:
>
> Hi,
> I'm porting a piece of code that uses TermPositions of Lucene 3.3 to 8.9.
> https://lucene.apache.org/core/4_0_0/MIGRATE.html reads:
>
>   *
>
> TermPositions is renamed to DocsAndPositionsEnum, and no longer extends 
> the
> docs only enumerator (DocsEnum).
>
> But DocsAndPositionsEnum doesn't exist in 8.9 either. There is no mention of
> removal of DocsAndPositionsEnum in any of MIGRATE.html files.
>
> Is there a recommended way to replace TermPositions? Here is the code I am
> trying to port to give you a context:
>
> private void calculateScores() throws IOException { // initialize buffers
> FixedBitSet docPointers = new FixedBitSet(reader.maxDoc()); List
> uniqueTerms = new LinkedList<>(new LinkedHashSet<>(terms)); uniqueTermSize =
> uniqueTerms.size(); this.roughThresholdFreq = (int) (uniqueTermSize *
> ROUGH_CUTOFF); for (Iterator iter = uniqueTerms.iterator();
> iter.hasNext();) { try (TermPositions tp = reader.termPositions(iter.next())) 
> {
> while (tp.next()) { int f = scoredDocs.adjustOrPutValue(tp.doc(), 1, 1); if 
> (f >
> roughThresholdFreq) { docPointers.fastSet(tp.doc()); } } } } if
> (docPointers.cardinality() > 0) { docPointerIterator = (FixedBitSetIterator)
> docPointers.iterator(); } }
>
>
> TK
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Fwd: org.apache.lucene.index.DirectoryReader Javadocs

2020-12-10 Thread Trejkaz
> May i request to add more info into Lucene
> org.apache.lucene.index.DirectoryReader about reaOnly=true attribute and
>
> more info on readerAttributes parameters please?

Referring to the current documentation:
https://javadoc.io/doc/org.apache.lucene/lucene-core/latest/org/apache/lucene/index/DirectoryReader.html

I see no such readerAttributes to which more information should be added.

Perhaps you should provide a URL to the documentation you are talking
about, so that people might know what you're going on about.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Port on iOS

2020-08-21 Thread Trejkaz
This looks interesting.

https://github.com/lukhnos/LuceneSearchDemo-iOS

On Sat, 22 Aug 2020 at 00:12, Saad Umar  wrote:
>
> I want to run Lucene with iOS, how do I do that
>
> --
>
> Best,
>
> Saad Umar
>
> Senior Software Engineer
>
> *Avanza Solutions (Pvt.) Ltd.*
>
> Office # 14-B, Fakhri Trade Centre SR 6/10, Shahrah-e-Liaquat New Challi,
> Karachi-74200, Pakistan.
> M +92 334 7353864
>
> U  +92 21 111-AVANZA (282-692) (642)
>
> Esaad.u...@avanzasolutions.com 
>
> W  *www.avanzasolutions.com *

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: TermsEnum.seekExact degraded performance somewhere between Lucene 7.7.0 and 8.5.1.

2020-07-30 Thread Trejkaz
On Mon, 27 Jul 2020 at 19:24, Adrien Grand  wrote:
>
> It's interesting you're not seeing the same slowdown on the other field.
> How hard would it be for you to test what the performance is if you
> lowercase the name of the digest algorithms, ie. "md5;[md5 value in hex]",
> etc. The reason I'm asking is because the compression logic is optimized
> for lowercase ASCII so removing uppercase letters would help remove the
> need to encode exceptions, which is one reason I'm thinking why the
> slowdown might be less on your other field.

It took me a while to get some free time to make a new version of the
test which doesn't have our own code in it so that I was able to add
the new field without rewriting a large chunk of our system... but it
looks like the timing for lowercase prefixes is around the same as
upper.

This particular test I've ended up doing though is a pathological
case, as it turned out to have 0 hits in the index despite searching
for 29 million digests.

-
Time for just reading the digest list
Count = 29459432, time = 1946 ms
Count = 29459432, time = 1752 ms
Count = 29459432, time = 1752 ms
-
Times for digest-upper
Count = 0, time = 40570 ms
Count = 0, time = 42574 ms
Count = 0, time = 40121 ms
-
Times for digest-lower
Count = 0, time = 40462 ms
Count = 0, time = 40319 ms
Count = 0, time = 39938 ms
-
Times for digest-no-prefix
Count = 0, time = 10936 ms
Count = 0, time = 10857 ms
Count = 0, time = 10628 ms
-

So about 4 times faster on the field with no term prefixes.
The code for all 3 tests is shared:

private static void timeDigest(Path md5sFile, IndexReader reader,
String field, String termPrefix) throws IOException {
try (BufferedReader md5sReader = Files.newBufferedReader(md5sFile)) {
TermsEnum termsEnum = MultiTerms.getTerms(reader, field).iterator();
PostingsEnum postingsEnum = null;

long t0 = System.currentTimeMillis();
int hitCount = 0;

while (true) {
String md5 = md5sReader.readLine();
if (md5 == null) {
break;
}

if (termsEnum.seekExact(new BytesRef(termPrefix + md5))) {
postingsEnum = termsEnum.postings(postingsEnum,
PostingsEnum.NONE);
while (postingsEnum.nextDoc() !=
DocIdSetIterator.NO_MORE_DOCS) {
hitCount++;
}
}
}

long t1 = System.currentTimeMillis();
System.out.println("Count = " + hitCount + ", time = " +
(t1 - t0) + " ms");
}
}

> In case you're using an old JRE, you might want to try out with a JRE 13 or
> more recent. Some of the logic in this lowercase ASCII compression only
> gets vectorized on JDK13+.

Times for JDK 14.0.2:

-
Times for just reading the digest list
Count = 29459432, time = 2050 ms
Count = 29459432, time = 2156 ms
Count = 29459432, time = 1905 ms
-
Times for digest-upper
Count = 0, time = 24336 ms
Count = 0, time = 24236 ms
Count = 0, time = 23986 ms
-
Times for digest-lower
Count = 0, time = 24440 ms
Count = 0, time = 23960 ms
Count = 0, time = 23956 ms
-
Times for digest-no-prefix
Count = 0, time = 13177 ms
Count = 0, time = 13095 ms
Count = 0, time = 13081 ms
-


Almost a 2:1 speed boost for prefixed timings just by updating the JDK...

The non-prefixed timings seem to be 30% slower than on JDK 8 (WTF?)
but still win when compared to the prefixed timings alone.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: TermsEnum.seekExact degraded performance somewhere between Lucene 7.7.0 and 8.5.1.

2020-07-27 Thread Trejkaz
Yep, the timings posted were the best speed out of 10 runs in a row.
The profiling was done in the middle of 1000 iterations in a row just
to knock off any warm-up time.

The sort of data we're storing in the field is quite possibly a
worst-case scenario for the compression. The data is mixed digest info
like

"MD5;[md5 value in hex]"
"SHA-1;[sha1 value in hex]"
"SHA-256;[sha256 value in hex]"

In fact, there's another field in the index which contains the same
MD5s without the common prefix - the same sort of operation on that
field doesn't get the same slowdown. (It's a bit slower. Like 5% or
so? Certainly nothing like 100%.) So at least for looking up MD5s we
have the luxury of an alternative option for the lookups. For other
digests I'm afraid we're stuck for now until we change how we index
those.

What's ironic is that we originally put the prefix on to make seeking
to the values faster. ^^;;

TX


On Mon, 27 Jul 2020 at 17:08, Adrien Grand  wrote:
>
> Alex, this issue you linked is about the terms dictionary of doc values.
> Trejkaz linked the correct issue which is about the terms dictionary of the
> inverted index.
>
> It's interesting you're seeing so much time spent in readVInt on 8.5 since
> there is a single vint that is read for each block in
> "LowercaseAsciiCompression.decompress". Are these relative timings
> consistent over multiple runs?
>
> On Mon, Jul 27, 2020 at 5:57 AM Alex K  wrote:
>
> > Hi,
> >
> > Also have a look here:
> > https://issues.apache.org/jira/plugins/servlet/mobile#issue/LUCENE-9378
> >
> > Seems it might be related.
> > - Alex
> >
> > On Sun, Jul 26, 2020, 23:31 Trejkaz  wrote:
> >
> > > Hi all.
> > >
> > > I've been tracking down slow seeking performance in TermsEnum after
> > > updating to Lucene 8.5.1.
> > >
> > > On 8.5.1:
> > >
> > > SegmentTermsEnum.seekExact: 33,829 ms (70.2%) (remaining time in our
> > > code)
> > > SegmentTermsEnumFrame.loadBlock: 29,104 ms (60.4%)
> > > CompressionAlgorithm$2.read: 25,789 ms (53.5%)
> > > LowercaseAsciiCompression.decompress: 25,789 ms (53.5%)
> > > DataInput.readVInt: 24,690 ms (51.2%)
> > > SegmentTermsEnumFrame.scanToTerm: 2,921 ms (6.1%)
> > >
> > > On 7.7.0 (previous version we were using):
> > >
> > > SegmentTermsEnum.seekExact: 5,897 ms (43.7%) (remaining time in our
> > > code)
> > > SegmentTermsEnumFrame.loadBlock: 3,499 ms (25.9%)
> > > BufferedIndexInput.readBytes: 1,500 ms (11.1%)
> > > DataInput.readVInt: 1,108 (8.2%)
> > > SegmentTermsEnumFrame.scanToTerm: 1,501 ms (11.1%)
> > >
> > > So on the surface it sort of looks like the new version spends less
> > > time scanning and much more time loading blocks to decompress?
> > >
> > > Looking for some clues to what might have changed here, and whether
> > > it's something we can avoid, but currently LUCENE-4702 looks like it
> > > may be related.
> > >
> > > TX
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> >
>
>
> --
> Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Fwd: TermsEnum.seekExact degraded performance somewhere between Lucene 7.7.0 and 8.5.1.

2020-07-26 Thread Trejkaz
Hi all.

I've been tracking down slow seeking performance in TermsEnum after
updating to Lucene 8.5.1.

On 8.5.1:

SegmentTermsEnum.seekExact: 33,829 ms (70.2%) (remaining time in our code)
SegmentTermsEnumFrame.loadBlock: 29,104 ms (60.4%)
CompressionAlgorithm$2.read: 25,789 ms (53.5%)
LowercaseAsciiCompression.decompress: 25,789 ms (53.5%)
DataInput.readVInt: 24,690 ms (51.2%)
SegmentTermsEnumFrame.scanToTerm: 2,921 ms (6.1%)

On 7.7.0 (previous version we were using):

SegmentTermsEnum.seekExact: 5,897 ms (43.7%) (remaining time in our code)
SegmentTermsEnumFrame.loadBlock: 3,499 ms (25.9%)
BufferedIndexInput.readBytes: 1,500 ms (11.1%)
DataInput.readVInt: 1,108 (8.2%)
SegmentTermsEnumFrame.scanToTerm: 1,501 ms (11.1%)

So on the surface it sort of looks like the new version spends less
time scanning and much more time loading blocks to decompress?

Looking for some clues to what might have changed here, and whether
it's something we can avoid, but currently LUCENE-4702 looks like it
may be related.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: CheckIndex complaining about -1 for norms value

2020-06-14 Thread Trejkaz
The answer here might be "terrifyingly old" actually. We've been using
IndexUpgrader quite heavily. My best guess is Lucene 2.x.

What I can verify at least, by manual inspection, is that the docs
where the value is showing -1 are also docs where there is no value in
the field, where I presume it was supposed to be zero instead. What's
funny though is that some values _are_ zero...

Anyway, I can probably try writing a tricky migration to rewrite those
to 0. Or deleting them entirely, because like I previously said, I
thought we weren't using norms at all. I'll have to track down the
truth on that because I don't know anymore. This index clearly has
them but it's also very old.

Only after migrating up to Lucene 8 did we get any complaints from
CheckIndex, though. So now I'm not sure whether I have broken this by
migrating to v8, or whether this is just a new check in v8 and the
index was already screwed. More investigation required. :(

TX


On Thu, 11 Jun 2020 at 16:00, Adrien Grand  wrote:
>
> To my knowledge, -1 always represented the maximum supported length, both
> before and after 7.0 (when we changed the norms encoding). One thing that
> changed when we introduced sparse norms is that documents with no value
> moved from having 0 as a norm to not having a norm at all, but I don't see
> how this could explain what you are seeing either.
>
> Do you know what is the Lucene version that initially indexed this document
> (and thus computed the norm value)?
>
> On Thu, Jun 11, 2020 at 8:45 AM Trejkaz  wrote:
>
> > Well,
> >
> > We're using the default Lucene similarity. But as far as I know, we've
> > always disabled norms as well. So I'm surprised I'm even seeing norms
> > mentioned in the context of our own index, which is why I wondered
> > whether -1 might have been an older placeholder for "no value" which
> > later became 0 or something.
> >
> > About the only thing I'm sure about at the moment is that whatever is
> > going on is weird.
> >
> > TX
> >
> > On Thu, 11 Jun 2020 at 15:38, Adrien Grand  wrote:
> > >
> > > Hi Trejkaz,
> > >
> > > Negative norm values are legal. The problem here is that Lucene expects
> > > that documents that have no terms must either not have a norm value
> > > (typically because the document doesn't have a value for the field), or a
> > > norm value equal to 0 (typically because the token stream over the field
> > > value produced no tokens).
> > >
> > > Are you using a custom similarity or one of the Lucene ones? One would
> > only
> > > get -1 as a norm with the Lucene similarities if it had a number of
> > tokens
> > > that is very close to Integer.MAX_VALUE.
> > >
> > > On Thu, Jun 11, 2020 at 4:22 AM Trejkaz  wrote:
> > >
> > > > Hi all.
> > > >
> > > > We use CheckIndex as a post-migration sanity check and are seeing this
> > > > quirk, and I'm wondering whether negative norms is even legit or
> > > > whether it should have been treated as if it were zero...
> > > >
> > > > TX
> > > >
> > > >
> > > > 0.00% total deletions; 378 documents; 0 deleteions
> > > > Segments file=segments_1 numSegments=1 version=8.5.1
> > > > id=52isly98kogao7j0cnautwknj
> > > >   1 of 1: name=_0 maxDoc=378
> > > > version=8.5.1
> > > > id=52isly98kogao7j0cnautwkni
> > > > codec=Lucene84
> > > > compound=false
> > > > numFiles=18
> > > > size (MB)=0.663
> > > > diagnostics = {java.vendor=Oracle Corporation, os=Mac OS X,
> > > > java.version=1.8.0_191, java.vm.version=25.191-b12,
> > > > lucene.version=8.5.1, os.arch=x86_64,
> > > > java.runtime.version=1.8.0_191-b12, source=addIndexes(CodecReader...),
> > > > os.version=10.15.5, timestamp=1591841756208}
> > > > no deletions
> > > > test: open reader.OK [took 0.004 sec]
> > > > test: check integrity.OK [took 0.002 sec]
> > > > test: check live docs.OK [took 0.000 sec]
> > > > test: field infos.OK [36 fields] [took 0.000 sec]
> > > > test: field norms.OK [26 fields] [took 0.001 sec]
> > > > test: terms, freq, prox...ERROR: java.lang.RuntimeException:
> > > > Document 0 doesn't have terms according to postings but has a norm
> > > > value that is not zero: -1
> > > >
> > > > java.lang.RuntimeException: Document 0 doesn't have terms according to
> >

Re: CheckIndex complaining about -1 for norms value

2020-06-11 Thread Trejkaz
Well,

We're using the default Lucene similarity. But as far as I know, we've
always disabled norms as well. So I'm surprised I'm even seeing norms
mentioned in the context of our own index, which is why I wondered
whether -1 might have been an older placeholder for "no value" which
later became 0 or something.

About the only thing I'm sure about at the moment is that whatever is
going on is weird.

TX

On Thu, 11 Jun 2020 at 15:38, Adrien Grand  wrote:
>
> Hi Trejkaz,
>
> Negative norm values are legal. The problem here is that Lucene expects
> that documents that have no terms must either not have a norm value
> (typically because the document doesn't have a value for the field), or a
> norm value equal to 0 (typically because the token stream over the field
> value produced no tokens).
>
> Are you using a custom similarity or one of the Lucene ones? One would only
> get -1 as a norm with the Lucene similarities if it had a number of tokens
> that is very close to Integer.MAX_VALUE.
>
> On Thu, Jun 11, 2020 at 4:22 AM Trejkaz  wrote:
>
> > Hi all.
> >
> > We use CheckIndex as a post-migration sanity check and are seeing this
> > quirk, and I'm wondering whether negative norms is even legit or
> > whether it should have been treated as if it were zero...
> >
> > TX
> >
> >
> > 0.00% total deletions; 378 documents; 0 deleteions
> > Segments file=segments_1 numSegments=1 version=8.5.1
> > id=52isly98kogao7j0cnautwknj
> >   1 of 1: name=_0 maxDoc=378
> > version=8.5.1
> > id=52isly98kogao7j0cnautwkni
> > codec=Lucene84
> > compound=false
> > numFiles=18
> > size (MB)=0.663
> > diagnostics = {java.vendor=Oracle Corporation, os=Mac OS X,
> > java.version=1.8.0_191, java.vm.version=25.191-b12,
> > lucene.version=8.5.1, os.arch=x86_64,
> > java.runtime.version=1.8.0_191-b12, source=addIndexes(CodecReader...),
> > os.version=10.15.5, timestamp=1591841756208}
> > no deletions
> > test: open reader.OK [took 0.004 sec]
> > test: check integrity.OK [took 0.002 sec]
> > test: check live docs.OK [took 0.000 sec]
> > test: field infos.OK [36 fields] [took 0.000 sec]
> > test: field norms.OK [26 fields] [took 0.001 sec]
> > test: terms, freq, prox...ERROR: java.lang.RuntimeException:
> > Document 0 doesn't have terms according to postings but has a norm
> > value that is not zero: -1
> >
> > java.lang.RuntimeException: Document 0 doesn't have terms according to
> > postings but has a norm value that is not zero: -1
> > at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1678)
> > at org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1871)
> > at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:724)
> > at org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:2973)
> >
> > test: stored fields...OK [15935 total field count; avg 42.2
> > fields per doc] [took 0.003 sec]
> > test: term vectorsOK [1173 total term vector count; avg
> > 3.1 term/freq vector fields per doc] [took 0.170 sec]
> > test: docvalues...OK [16 docvalues fields; 11 BINARY; 2
> > NUMERIC; 0 SORTED; 2 SORTED_NUMERIC; 1 SORTED_SET] [took 0.003 sec]
> > test: points..OK [4 fields, 1509 points] [took 0.000 sec]
> > FAILED
> > WARNING: exorciseIndex() would remove reference to this segment;
> > full exception:
> > java.lang.RuntimeException: Term Index test failed
> > at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:750)
> > at org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:2973)
> >
> > WARNING: 1 broken segments (containing 378 documents) detected
> > Took 0.355 sec total.
> > WARNING: would write new segments file, and 378 documents would be
> > lost, if -exorcise were specified
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>
> --
> Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



CheckIndex complaining about -1 for norms value

2020-06-10 Thread Trejkaz
Hi all.

We use CheckIndex as a post-migration sanity check and are seeing this
quirk, and I'm wondering whether negative norms is even legit or
whether it should have been treated as if it were zero...

TX


0.00% total deletions; 378 documents; 0 deleteions
Segments file=segments_1 numSegments=1 version=8.5.1
id=52isly98kogao7j0cnautwknj
  1 of 1: name=_0 maxDoc=378
version=8.5.1
id=52isly98kogao7j0cnautwkni
codec=Lucene84
compound=false
numFiles=18
size (MB)=0.663
diagnostics = {java.vendor=Oracle Corporation, os=Mac OS X,
java.version=1.8.0_191, java.vm.version=25.191-b12,
lucene.version=8.5.1, os.arch=x86_64,
java.runtime.version=1.8.0_191-b12, source=addIndexes(CodecReader...),
os.version=10.15.5, timestamp=1591841756208}
no deletions
test: open reader.OK [took 0.004 sec]
test: check integrity.OK [took 0.002 sec]
test: check live docs.OK [took 0.000 sec]
test: field infos.OK [36 fields] [took 0.000 sec]
test: field norms.OK [26 fields] [took 0.001 sec]
test: terms, freq, prox...ERROR: java.lang.RuntimeException:
Document 0 doesn't have terms according to postings but has a norm
value that is not zero: -1

java.lang.RuntimeException: Document 0 doesn't have terms according to
postings but has a norm value that is not zero: -1
at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1678)
at org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1871)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:724)
at org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:2973)

test: stored fields...OK [15935 total field count; avg 42.2
fields per doc] [took 0.003 sec]
test: term vectorsOK [1173 total term vector count; avg
3.1 term/freq vector fields per doc] [took 0.170 sec]
test: docvalues...OK [16 docvalues fields; 11 BINARY; 2
NUMERIC; 0 SORTED; 2 SORTED_NUMERIC; 1 SORTED_SET] [took 0.003 sec]
test: points..OK [4 fields, 1509 points] [took 0.000 sec]
FAILED
WARNING: exorciseIndex() would remove reference to this segment;
full exception:
java.lang.RuntimeException: Term Index test failed
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:750)
at org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:2973)

WARNING: 1 broken segments (containing 378 documents) detected
Took 0.355 sec total.
WARNING: would write new segments file, and 378 documents would be
lost, if -exorcise were specified

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: StanardFilter Question : https://issues.apache.org/jira/browse/LUCENE-8356

2019-06-25 Thread Trejkaz
Yeah, that code looks right to me.

The factory we use for keeping backwards compatibility is entirely
ours. I think CustomAnalyzer is a similar-looking API to what we have
but we made ours much earlier and it supports analysis stuff all the
way back to Lucene 3 which we migrated all the way to where we are
now.

TX

On Wed, 26 Jun 2019 at 06:47,  wrote:
>
> Corrected a typo below in the new code.
>
> Best regards
>
>
> On 6/25/19 5:01 PM, baris.ka...@oracle.com wrote:
> > Hi,-
> >
> >  do You mean there is a backward compatibility factory in Lucene for
> > these kinds of cases?
> >
> > i think it can be fixed like this,  In other words is the following
> > first line redundant then?
> >
> > TokenStream filter = new StandardFilter(tokenizer); -> redundant
> > (tokenizer is actually a StandardTokenizer object).
> > filter = new ElisionFilter(filter, getDefaultArticles());-> tokenizer
> > can be directly used here
> >
> > filter = new LowerCaseFilter(filter);
> >
> > ->
> >
> > TokenStream filter = new ElisionFilter(*tokenizer*,
> > getDefaultArticles());*//not filter here*
> >
> > filter = new LowerCaseFilter(filter);
> >
> >
> > I also saw that some public fields have now different return type like
> > org.apache.lucene.search.TopDocs.totalHits field which is long type now.
> >
> > this affects my rest of the code very much but luckliy there is
> > Math.toIntExact which throws ArithmeticException when number is really
> > long number outside integer limit.
> >
> > In my case i will not exceed integer limit anyways.
> >
> >
> > Best regards
> >
> >
> > On 6/24/19 5:19 PM, Trejkaz wrote:
> >> I did the research on this one because it confused me as well, but it
> >> seems it was a no-op. So the replacement is just to remove it from the
> >> filter chain.
> >>
> >> We have a backwards compatibility filter factory, so we deal with it
> >> by keeping around a compatibility implementation which just does
> >> nothing like before.
> >>
> >> TX
> >>
> >>
> >> On Tue, 25 Jun 2019 at 06:21,  wrote:
> >>> According to this jira ticket, where else is StandardFilter included in
> >>> Lucene 8.1.1?
> >>>
> >>> and why is it a no-op now in Lucene 8.1.1?
> >>>
> >>> I wish the tickets were a bit more explicit and suggest what to use
> >>> instead for deprecated versions like in version 7.5.0 or why it became
> >>> no-op in version 8.1.1?
> >>>
> >>> this will make easy when ugrading to later versions.
> >>>
> >>> Thanks
> >>>
> >>>
> >>> -
> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: StanardFilter Question : https://issues.apache.org/jira/browse/LUCENE-8356

2019-06-24 Thread Trejkaz
I did the research on this one because it confused me as well, but it
seems it was a no-op. So the replacement is just to remove it from the
filter chain.

We have a backwards compatibility filter factory, so we deal with it
by keeping around a compatibility implementation which just does
nothing like before.

TX


On Tue, 25 Jun 2019 at 06:21,  wrote:
>
> According to this jira ticket, where else is StandardFilter included in
> Lucene 8.1.1?
>
> and why is it a no-op now in Lucene 8.1.1?
>
> I wish the tickets were a bit more explicit and suggest what to use
> instead for deprecated versions like in version 7.5.0 or why it became
> no-op in version 8.1.1?
>
> this will make easy when ugrading to later versions.
>
> Thanks
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IntField to IntPoint

2019-06-05 Thread Trejkaz
How we would do it:

- update the index format to v7 (this in itself is fiddly
  but there are ways)
- open the index in-place migrated:
- get all the leaf indices and wrap each in a new
  subclass of FilterCodecReader
- override getPointsReader() on that subclass
  to return a correctly implemented PointsReader,
  which can read the data from the stored fields
- be careful about the order you return the points
- you might want to spool the points to a
  database like Derby or H2 since if you have a lot
  of data there is a risk of running out of memory
- copy that whole index to a new index using
  IndexWriter#addIndexes(CodecReader...)

Copying the docs works too if you have the original text stored still, but
we didn’t, so we use this sort of technique for all Lucene migrations.

TX


On Thu, 6 Jun 2019 at 07:07, Riccardo Tasso 
wrote:

> Ok,
>  I know this policy and you perfectly explained why it makes sense.
>
> Anyway my index is really big and contains mostly textual data which are
> expensive to reindex (because of custom analysis).
>
> Considering that the IndexUpgrader will efficiently do the most of the work
> I should investigate how to fill this gap, without reindexing from scratch.
>
>
> The most efficient approach I can figure is:
> * convert from 4 to 7
> * open an index reader and an index writer on the 7 index
> * iterate every document
> * read the numeric field (since it's already stored)
> * add to each document the IntPoint field
> * update the document on the index
>
> I guess the expensive task here is the update, since it will delete and
> readd the document, but in this case I think I will save the analysis
> costs.
>
> Do you think there's a better way of doing this reindex?
>
> Thanks
>
>
> Il mer 5 giu 2019, 17:41 Erick Erickson  ha
> scritto:
>
> > You cannot upgrade more than one major version, you must re-index from
> > scratch. There’s a long discussion of why, but basically it’s summed up
> by
> > this quote from Robert Muir:
> >
> > “I think the key issue here is Lucene is an index not a database. Because
> > it is a lossy index and does not retain all of the user's data, its not
> > possible to safely migrate some things automagically. In the norms case
> > IndexWriter needs to re-analyze the text ("re-index") and compute stats
> to
> > get back the value, so it can be re-encoded. The function is y = f(x) and
> > if x is not available its not possible, so lucene can't do it.”
> >
> > This has always been true, before 8x it would just  fail silently as  you
> > have found. Solr/Lucene starts up but don’t  work quite as expected. As
> of
> > Lucene 8x, Lucene (and therefore Solr) will not even open an index that
> > has  _ever_ been touched by Lucene 6x, no matter what intervening steps
> > have been taken. Or in general,  Lucene/Solr X will  not  open indexes
> > touched by X-2, starting with 8x rather than behave unexpectedly.
> >
> > Best,
> > Erick
> >
> > > On Jun 5, 2019, at 8:27 AM, Riccardo Tasso 
> > wrote:
> > >
> > > Hello everybody,
> > > I have a (very big) lucene 4 index with documents using IntField. On
> that
> > > field, which should be stored and sortable, I should search and execute
> > > range queries.
> > >
> > > I've tried to upgrade it from 4 to 7 with IndexUpgrader but I observed
> > that
> > > IntFields aren't searchable anymore.
> > >
> > > Which is the most efficient way to convert IntFields to IntPoints,
> which
> > > are stored and sortable?
> > >
> > > Thanks,
> > > Riccardo
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> >
>


Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Trejkaz
On Sun, 26 May 2019 at 23:49, Namgyu Kim  wrote:

> I think so about that approach.
> It's not user-friendly and it is not good for the user.

I think it's better to get the parameters in

JapaneseTokenizer.
>
> What do you think about this?


A way to override the system dictionary would be useful for us as well. We
often get people complaining that the current dictionary is missing a lot
of common modern words, and there are alternate mecab dictionaries sitting
around already which solve this problem.

TX


>
>
>


Re: How can I decode geo point postings?

2019-03-31 Thread Trejkaz
On Mon, Apr 1, 2019 at 5:32 AM David Smiley  wrote:
>
> Yup.  And if you have the original lat/lon then you can forgo the
> complexity of reverse-engineering it from postings.

It has been a long day.

I did manage to reverse engineer it by reversing the stuff in
geoCodedToPrefixCodedBytes - to discover that even the highest
precision value indexed in postings wasn't precise enough to be
acceptable. Then I discovered that doc values had more precision, but
that didn't give me the same value I got with a LatLonPoint either.
(By this point I had dug into a newer index to get out the exact
values the new version of Lucene was putting in, and had my unit test
asserting that the result was exactly the same.)

So in the end I have ended up digging it back out of a stored field
and parsing that. Which was actually harder to parse than the
postings, mind you, but contained the full value, so that's fine. The
only disappointing thing is that it's a little slower than using doc
values or postings would have been. :(

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How can I decode geo point postings?

2019-03-27 Thread Trejkaz
On Mon, Mar 11, 2019 at 1:15 PM Trejkaz  wrote:
>
> Hi all.
>
> I'm attempting to migrate from GeoPointField to LatLonPoint so that we
> might have a hope in updating to Lucene 7. The first hurdle I'm
> hitting is while writing the migration code.
>
> I inserted a single document with one geo point in it on Lucene 6.6,
> and when I iterate the postings, I see the following binary terms:
>
> [37 80 0 0 a]
> [37 ac 80 0 13]
> [37 ac 98 0 1c]
>
> 1) The existence of three terms is presumably something to do with
> multiple precision. Is there a safe way to figure out which value I
> want? e.g., is 0x1C always the value that I want?
>
> 2) When I figure out which value I want, where should I be looking for
> the logic to get out the latitude and longitude?

After a couple weeks it appears that nobody knows the answer, but does
anyone at least know where in the Lucene code I should be looking to
find the answer for myself?

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



How can I decode geo point postings?

2019-03-10 Thread Trejkaz
Hi all.

I'm attempting to migrate from GeoPointField to LatLonPoint so that we
might have a hope in updating to Lucene 7. The first hurdle I'm
hitting is while writing the migration code.

I inserted a single document with one geo point in it on Lucene 6.6,
and when I iterate the postings, I see the following binary terms:

[37 80 0 0 a]
[37 ac 80 0 13]
[37 ac 98 0 1c]

1) The existence of three terms is presumably something to do with
multiple precision. Is there a safe way to figure out which value I
want? e.g., is 0x1C always the value that I want?

2) When I figure out which value I want, where should I be looking for
the logic to get out the latitude and longitude?

Once I have the latitude and longitude from the old data, it seems
like a relatively simple task to get it back into the points field.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene sort in AaZz order

2019-01-15 Thread Trejkaz
On Wed, Jan 16, 2019 at 2:29 AM Adrien Grand  wrote:
>
> Assuming that you need case-insensitive sort, the most straightforward
> way to do this would be to index the lowercase family name:
> SortedDocValuesField("by_name", new
> BytesRef(family.getName().toLowerCase(Local.ROOT))).
>
> It is also possible to implement a custom FieldComparatorSource, but
> this will likely be both more complicated and slower.

Probably actually want to use
toUpperCase(Locale.ROOT).toLowerCase(Locale.ROOT), and possibly even a
Normalizer.normalize before that. Unless you can use ICU's normaliser
with built-in case folding, which simplifies it a lot.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Efficient way to define large Boolean Occur.FILTER clause in Lucene 6

2018-06-26 Thread Trejkaz
On Tue, Jun 26, 2018 at 7:02 PM, Hasenberger, Josef
 wrote:
> However, I have a feeling that the conversion from Long values to Terms is
> rather inefficient for large collections and also uses a lot of memory.
> To ease conversion overhead somewhat, I created a class that converts a
> Long value directly to BytesRef instance (in order to avoid conversion to
> UTF16 and then UTF8 again) and pass that instance to the Term constructor.

First thought is, why are you using TermsQuery if they're in DocValues?
Is DocValuesTermsQuery any better? It does depend on how many terms
you're searching for.

Second thought is that there is also DocValuesNumbersQuery, which
avoids having to convert all the values.

> I just wonder if there is a better method for passing large amount of filter 
> criteria
> to a BooleanQuery Occur.FILTER clause, that avoids excessive object creation.

If you can get your long values into something which implements Bits,
you could make a query using RandomAccessWeight to directly point at
the existing set you already have in memory.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Recommendation for doing a search plus collecting extra information?

2018-03-26 Thread Trejkaz
On Mon, Oct 12, 2015 at 4:32 AM, Alan Woodward <a...@flax.co.uk> wrote:
> Hi Trejkaz,
>
> You can still use a standard collector if you don’t need to worry about 
> multi-threaded search.
> It sounds as though what you want to do is implement your own Collector that 
> will read and
> record docvalues hits, and use MultiCollector to wrap it and a standard 
> TopDocsCollector together.

This is what I'm currently trying out, but I'm hitting exactly the
problem I predicted. To use the values, I have to put them into some
kind of storage.

I can put them into an int[] but then it's the worst case memory usage
for queries returning a small number of hits.

Or I can put them into something like a fastutil Int2IntOpenHashMap,
which reduces the memory usage for small queries, while also making
large queries much slower.

Neither of these is really appealing right now.

Two ideas but I can't figure out if they'll work:

1. The doc IDs are visited in order, at least within each segment. Is
there a structure in Lucene itself somewhere which can store that off
quickly and efficiently?

2. Am I allowed to just hold onto the NumericDocValues for each leaf
and hold onto them for a long period of time, or is there an
implementation of them which breaks that? I figure it's already
sitting around, so that should be zero additional storage?

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Recommendation for doing a search plus collecting extra information?

2018-03-05 Thread Trejkaz
I did some experiments.

As it turns out, changing SortedNumericSortField to SortField had no
effect on the timings at all.
However, changing the SortField.Type from LONG to INT makes queries
come back 3 times faster.
(20ms vs. 6.5ms comparing the fastest runs for each.)

Why would using int be 3 times faster, and not 2?

(And repeating from the last mail, is there a way to use less memory?)

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Recommendation for doing a search plus collecting extra information?

2018-02-28 Thread Trejkaz
On Mon, Oct 12, 2015 at 3:28 PM, Uwe Schindler  wrote:
> Hi,
>
> it may sound a bit stupid, but you can do the following:
>
> If you search for a docvalues (previously fieldcache) field in lucene, the 
> returned TopFieldDocs contains also the field values
> that were sorted against. The ScoreDoc instances in this collection are 
> actually FieldDoc instances (cast them down):
> https://lucene.apache.org/core/5_3_1/core/org/apache/lucene/search/FieldDoc.html
>
> So my suggestion would be: sort primarily against score (SortField.SCORE), 
> but add a secondary sort field with the docvalues
> field you want to be part of your results. The results will be primarily 
> sorted against the score so you should still get the results
> in right order, but you can have the docvalues field as part of your 
> TopFieldDocs
> (https://lucene.apache.org/core/5_3_1/core/org/apache/lucene/search/TopFieldDocs.html)
>  collections after downcasting
> the ScoreDoc to Fieldoc (the sorted fields are saved as Object[] instances). 
> Choose the second FieldDoc field and cast
> it to your data type.

Well, this solution was working fine for a long time, but now we have
some users crying about the additional memory usage.

We're using this sort field:

private static final SortedNumericSortField ID_SORT_FIELD =
new SortedNumericSortField(LuceneFields.ID.getName(),
SortField.Type.LONG);

Since the field isn't sorted anything I think I can now change it to just:

private static final SortField ID_SORT_FIELD =
new SortField(LuceneFields.ID.getName(), SortField.Type.LONG);

Either way it ends up creating a LongComparator, though, which seems
to be what is being complained about. The memory usage of
LongComparator seems totally fine to me and it's using what seems to
be the minimum storage for what it's doing, so it's not like it can be
improved, but maybe there is a way to make a comparator that doesn't
have to store a copy of the data in memory?

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene with Database

2017-12-27 Thread Trejkaz
On Thu, Dec 28, 2017 at 1:07 AM, Riccardo Tasso
 wrote:
> Hi,
> I am not aware of any lucene integration with rdbms

Derby has a plugin of some sort. I haven't tried it so I have no idea
what it actually does, but it looks like it adds table functions which
you could join to other queries.

https://db.apache.org/derby/docs/10.13/tools/rtoolsoptlucene.html

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: UnsupportedOperationException from Outputs.merge, during addIndexes

2017-12-11 Thread Trejkaz
On Mon, Dec 11, 2017 at 10:59 PM, Adrien Grand  wrote:
> This means the FST builder is fed twice with the same key, so it tries to
> merge their outputs. This should not happen since the terms dictionary
> deduplicates terms.
>
> Do you get additional errors if you enable assertions? What are the codec
> readers that you pass to addIndexes? Could they contain duplicate terms?

This hint is a good lead, I'll start by checking all our own reader
implementations to see whether any of them could return the same term
more than once. The index I have been given might be broken somehow
too, but we're also migrating the data by creating "fake" codec
readers for things like postings, so it could be literally anywhere at
this point.

Turns out I still don't get to find out which field did it yet either,
because the most suspicious field didn't trigger it when migrated by
itself, and my overnight attempt died for other reasons. :)

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



UnsupportedOperationException from Outputs.merge, during addIndexes

2017-12-10 Thread Trejkaz
Hi all.

I have an addIndexes call which in my over-weekend run threw an
UnsupportedOperationException from deep inside Lucene's code.

I'm wondering what sort of condition this is expected to occur in. The
source postings it's writing might be corrupt in some way, and if I
figure out what way it's corrupt, I can try to work around it.

I'm sure it's reproducible so I've put a breakpoint on the spot
already, but it takes a long time to get there.

Version in use is Lucene 6.6.0.

TX


java.lang.UnsupportedOperationException
at org.apache.lucene.util.fst.Outputs.merge(Outputs.java:97)
at org.apache.lucene.util.fst.Builder.add(Builder.java:459)
at 
org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$PendingBlock.append(BlockTreeTermsWriter.java:503)
at 
org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$PendingBlock.compileIndex(BlockTreeTermsWriter.java:475)
at 
org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:635)
at 
org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.pushTerm(BlockTreeTermsWriter.java:907)
at 
org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:871)
at 
org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:344)
at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:105)
at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:164)
at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:216)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101)
at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:2824)

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 6.1.0 index upgrade

2017-11-10 Thread Trejkaz
On Sat, Nov 11, 2017 at 7:09 AM, Krishnamurthy, Kannan
 wrote:
> Never mind my previous question, understood what you meant about the impact 
> to norms after
> looking at the uses of CreatedMajorVersion in various Similarity classes. It 
> almost looks
> re-indexing is the only option here. This will be quite a big change to 
> embrace for the users
> of addIndexes().

Reindexing isn't even an option for some of us, because we don't have
the original content which was indexed.

So personally I'm hoping that IndexUpgrader continues to work. :)

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to fetch documents for which field is not defined

2017-07-16 Thread Trejkaz
On Sat, Jul 15, 2017 at 8:12 PM, Uwe Schindler  wrote:
> That is the "Solr" answer. But it is slow like hell.
>
> In Lucene there is a natove query named FieldValueQuery already for this.
> It requires DocValues enabled for the field.
>
> IMHO, the best and fastest variant (also to Solr users) is to add a separate
> multivalued string field named 'fieldnames' where you index all field named
> that have a value. After that you can query on this using the field name.
> Elasticsearch is doing the field name approach for exists/not exists by 
> default.

The catch is, you usually have to analyse a field to determine whether
it has a value. Apparently Elasticsearch's field existence query does
not do this, so it considers blank text to be a value, which is not
the same as what the user expected when they did the query.

We *were* using FieldValueQuery, but since moving to Lucene 6 we have
stopped using uninverting reader, so that option doesn't cover all
fields, and fields like "content" aren't really practical to put in
DocValues...

The approach to add a fieldnames field works, but is fiddly at
indexing-time, because now you have to use TokenStream for all fields,
so that you can read one token from each field to test whether there
is one before you add the whole document. I guess it's at least easier
to understand how it works at query-time.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DocValue update methods don't appear to throw exception if the document doesn't exist

2017-07-07 Thread Trejkaz
On Thu, Jul 6, 2017 at 8:28 PM, Joe Ye  wrote:
> Thanks very much TX!
>
> Regarding "But the updates don't actually occur during the call", could you
> elaborate on this a bit more? So when would the actual update occur, by
> which I mean persisting to disk?

The same as any other updates - when you call commit().

> Is there a cache of a number of docValues updates before committing to disk?

Someone who knows the internals better would probably have to answer
this one. I don't know how merging of updates to doc values works. (I
am guessing the doc values generation come into play here?) But
flushing to disk and committing changes are two different things
anyway. Lucene will periodically flush changes to disk when it decides
that it can't keep more in memory, This is not the same as it actually
committing, which makes the changes visible to newly-opened readers.
You just end up with files on disk which aren't referenced from the
segments file yet.

> If so, when happens if a crash occurs before those updates are committed?

Hopefully none of the updates occur, but the science hasn't been done.
(It seems like a fairly easy experiment to do if you really want to
test it.)

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DocValue update methods don't appear to throw exception if the document doesn't exist

2017-07-04 Thread Trejkaz
On Tue, 4 Jul 2017 at 22:39, Joe Ye  wrote:

> Hi,
>
> I'm using Lucene core 6.6.
>
> I noticed an issue that DocValue update methods
> (indexWriter.updateNumericDocValue
> & indexWriter.updateBinaryDocValue) don't appear to throw exception or
> return any error code if the document doesn't exist. Is this intentional? I
> don't want to check the existence of the document before each docValue
> update.


Given that they take Term or Query, that's what one would intuitively
expect from such an API (it will match 0..n docs.)

But the updates don't actually occur during the call, so there is no way
for it to know how many updates will happen in advance anyway. (Otherwise
it would be nice to know the number of docs it updated, like in JDBC.)

TX


Re: Ways to store and search tens of billions of text document content in one lucene index

2017-06-23 Thread Trejkaz
On Fri, Jun 23, 2017 at 4:24 PM, Ranganath B N  wrote:
> Hi,

[cutting X-Y problem stuff]

> What strategies do you recommend  for this task  "Ways to store  and search  
> tens of billions
> of  text document content in one lucene index"?  so that I can accomplish 
> this in optimal time.

Split it into multiple indexes and don't tell whoever asked you to put
it in one index, or change the definition of "index" for your
application so that the resulting multiple Lucene indices is still
"one index".

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Improving Performance by Combining Multiple Fields into Single Field

2017-06-22 Thread Trejkaz
On Thu, Jun 22, 2017 at 3:23 PM, aravinth thangasami
 wrote:
> Hi,
> Reading through the web, How elastic search's   *_source* field stores
> entire document and use* _source* for field retrieving.
> Does it better than* document.get * or loading entire *indexreader.document  
> ?*

I'd assume that it's worse, but in the case of Elasticsearch, since
they already wanted to store the entire source document for other
reasons, storing both the source document *and* the stored fields was
wasteful.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Does forceMerge(1) not always merge to one segment?

2017-05-21 Thread Trejkaz
On Mon, May 22, 2017 at 3:36 PM, Uwe Schindler <u...@thetaphi.de> wrote:
> Hi Trejkaz,
>
> yes, it calls forceMerge, but this is just a "trick" to look at each segment 
> while merging. But finally it
> decides on the version number of each segment, if it gets merged as part of 
> the forceMerge(1). If the
> version number of an segment is already on the latest version (because the 
> index was already used
> with 4.10 and new documents were added/updated), it will just remove it from 
> the list of the segs to
> merge. It looks like the index in 4.10.4 already has a lot of segments on 
> 4.10 version. When you
> upgrade it directly after that to 5.5.2, of course it has to merge all 
> segments, as all segs are "only
> on 4.10".
>
> The whole trick is in the IndexUpgraderMergePolicy. This one implements the 
> algorithm above.

So a MergePolicy can override the desire of the caller to have a
single segment? I thought the caller would ultimately get the power of
veto, but I guess not. :)

It turns out we were relying on this for a later migration, but I
guess it might be easier to somehow make it work regardless of the
number of segments which came out the end. Because all indexes before
this were created on v3 and the migration goes all the way to v5, it
took quite a while to run into an index which didn't end up as a
single segment after both passes... (I guess it would have to have
over 100 segments?)

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Does forceMerge(1) not always merge to one segment?

2017-05-21 Thread Trejkaz
We're using IndexUpgrader to upgrade indexes.

The 4.10.4 version of this appears to be implemented with a call to
forceMerge(1). But when I look at the result for one particular index
here, I see that it has 11 segments after doing the merge.

When the 5.5.2 version was then run against the same index, it brought
it down to 2 ... but it seems to have been implemented using
forceMerge(1) too.

What is going on here?

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: will lucene traverse all segments to search a 'primary key'term or will it stop as soon as it get one?

2017-04-20 Thread Trejkaz
On Fri, Apr 21, 2017 at 1:09 PM, 马可阳  wrote:
> Let’s say I have a user info index and user id is the ‘primary key’. So when 
> I do a userid term search,
> will lucene traverse all segments to search a 'primary key'term or will it 
> stop as soon as it get one?
>
> If it is the latter one, will any plan to make it the former way?

Thoughts:

1) What happens if you limit the result count to 1 *and* turn scoring off?

2) If that didn't work, a custom collector which terminates on finding
a single hit sounds easy enough to write.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



QueryNode / query parser performance

2017-04-12 Thread Trejkaz
So...

I know none of this work is possible to contribute back to Lucene
because the API I've ended up with is too different, but I thought I
would share anyway.

For a query with 10,000 terms:

Before any changes: ~7s

Change 1: Change QueryNodeImpl to hold an immutable list of children
and only copy the children when changes are made.
   New time: ~4s

Change 2: Change QueryNode/QueryNodeImpl to get rid of getParent() so
that it doesn't have to update every time you change the hierarchy.
   New time: ~100ms

Change 3: Change QueryNode itself to be mostly-immutable (tags are
tricky and not done yet), so that trees of nodes don't have to be
cloned.
   New time: ~80ms

The next ones on the list...

QueryParserLexer$DFA24.specialStateTransition()
   31.15701511,736 ms (31.2%)11,736 ms11,736 ms11,736 ms
TokenStream.assertFinal()
   17.6659686,654 ms (17.7%)6,654 ms6,654 ms6,654 ms
QueryNodeProcessorImpl.processIteration()
   12.3971314,669 ms (12.4%)4,669 ms19,552 ms19,552 ms

The parser one I probably can't do much about unless a newer version
of ANTLR is significantly faster, but that assertFinal() is
interesting. I guess this method is fairly expensive, and
AnalyzerQueryNodeProcessor is creating a new one over and over again?

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Weird cloning in QueryNode implementations

2017-04-10 Thread Trejkaz
Hi all.

Something queer I found while looking at QueryNode implementations is
this sort of thing:

@Override
public FieldQueryNode cloneTree() throws CloneNotSupportedException {
FieldQueryNode fqn = (FieldQueryNode) super.cloneTree();
fqn.begin = this.begin;
fqn.end = this.end;
fqn.field = this.field;
fqn.text = this.text;
fqn.positionIncrement = this.positionIncrement;

return fqn;
}

I guess what I don't get is the point of all these field copies, because:

* super.cloneTree() seems to call QueryNodeImpl#cloneTree()
* QueryNodeImpl#cloneTree seems to call Object#clone()
* Object#clone() seems to already copy all primitives and
references (a "shallow copy")

So to me it looks like the query node classes are doing a lot of
pedalling, but that the gears aren't really connected to the wheels,
and that classes should only override these if they have a more
complex object which isn't already taken care of automatically.

Or am I missing something?

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is there some sensible way to do giant BooleanQuery or similar lazily?

2017-04-03 Thread Trejkaz
On Mon, Apr 3, 2017 at 6:25 PM, Adrien Grand  wrote:
> Large boolean queries can cause a lot of random access as each sub clause
> is advanced one after the other. Even in the case that everything fits in
> the filesystem cache, the fact that the heap needs to be rebalanced after
> each documents makes queries on many clauses slow. This is why we have
> TermInSetQuery (TermsQuery on 6.x): it has a more disk-friendly access
> pattern (1 seek per term per segment) and scales better with the number of
> terms. Unfortunately it does not only come with benefits and its main
> drawback is that it is always evaluated againts the entire index. So if you
> intersect a very selective query (on an id field for instance) with a large
> TermInSetQuery, the TermInSetQuery will dominate the execution time for
> sure.

One such case which we do have is searching on file digests, where all
the values are spread across the entire index, and the common prefixes
don't allow much of a win from things like automata. For those,
though, TermsQuery might still work.

The problem is more things like word lists, where one "word" might
analyse to multiple terms, making a phrase query - which prevents
using TermsQuery. Collapsing it to some kind of conditional
multi-phrase query... yeah, I have no idea whether there is any
sensible way to do it.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Is there some sensible way to do giant BooleanQuery or similar lazily?

2017-04-02 Thread Trejkaz
Hi all.

We have this one kind of query where you essentially specify a text
file which contains the actual query to search for. The catch is that
the text file can be large.

Our custom query currently computes the set of matching docs up-front,
and then when queries come in for one LeafReader, the larger doc ID
set is sliced so that the sub-slice for that leaf is returned. Which
is confusing, and seems backwards.

As an alternative, we could override rewrite(IndexReader) and return a
gigantic boolean query. Problems being:

  1) A gigantic BooleanQuery takes up a lot more memory than a list of
query strings.

  2) Lucene devs often say that gigantic boolean queries are bad,
maybe for reason #1, or maybe for another reason which nobody
understands

So in place of this, is there some kind of alternative?

For instance, is there some query type where I can provide an iterator
of sub-queries, so that they don't all have to be in memory at once?
The code to get each sub-query is always relatively straight-forward
and easy to understand.

I guess the snag is that sometimes the line of text is natural
language which gets run through an analyser, so we'd potentially be
re-analysing the text once per leaf reader? :/

This would replace about 1/3 of the remaining places where we have to
compute the doc ID set up-front.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Index error

2017-03-30 Thread Trejkaz
What if totalHits > 1?

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Range queries get misinterpreted when parsed twice via the "Standard" parsers

2017-03-09 Thread Trejkaz
On Fri, 10 Mar 2017 at 01:19, Erick Erickson 
wrote:

> There has never been a guarantee that going back and forth between a
> parsed query and its string representation is idempotent. so this
> isn't supported.


Maybe delete the toQueryString method...

There is a fundamental design problem with it anyway, in that it produces a
syntax which isn't necessarily the one you parsed in the first place. We
ended up making a whole family of QuerySyntaxFormatter for all node classes
and had it produce exactly the syntax we consider the cleanest. (Still not
what the user typed in, but aiming to be better when the two differ.)

Although in this case, it does seem like it could have moved the field
outside the brackets to avoid this problem...

TX


Re: Grouping in Lucene queries giving unexpected results

2017-02-16 Thread Trejkaz
On Fri, Feb 17, 2017 at 11:14 AM, Erick Erickson
 wrote:
> Lucene query logic is not strict Boolean logic, the article above explains 
> why.

tl;dr it mostly comes down to scoring and syntax.

The scoring argument will depend on how much you care. (My care for
scoring is pretty close to zero, as I don't care whether the better
results come first, as long as the exact results come back and the
non-results don't.)

For the syntax:

* The article doesn't really address the (-NOT) problem, where
essentially Lucene could insert an implicit *:* when there isn't one,
to make those queries at least get a sane result. You can work around
this by customising the query parser, possible for both for the
classic one (subclass it and override the method to create the
BooleanQuery) and the flexible one (add a processor to the pipeline).

* The article strongly encourages using the +/- syntax instead of
AND/OR/NOT, but the astute might notice that AND/OR/NOT is three
operators, whereas +/- is only two, so clearly one of the boolean
clause types does not have a prefix operator, making it literally
impossible to specify some queries using the prefix operators alone.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Grouping in Lucene queries giving unexpected results

2017-02-16 Thread Trejkaz
On Fri, Feb 17, 2017 at 5:42 AM, Michael Peterson  wrote:
> I have a question about the meaning and behavior of grouping behavior with
> Lucene queries.

For this query:

host:host_1 AND (NOT location:location_5)

The right hand side is:

NOT location:location_5

Which matches nothing, as it has no positive clauses. And, of course,
ANDing that with any other query results in matching nothing.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How do I write in 3.x format to an upgradeded index using Lucene 4.10

2017-01-31 Thread Trejkaz
> If we take our old 3.x index and apply IndexUpgrader to it, we end up with a 
> 4.10 index.
> There are several lucene 4.x files created in the index directory and no 
> errors are thrown.
> However, it appears that the index data is still in the 3.x format, namely it 
> remains:
> "thanks", "coming"
> and not:
> "thanks", , "coming"

Well, this is a different thing really. The index is in the 4.x
format, but the analysis which was performed remains the 3.x analysis,
because nothing was done to change the postings.

So this whole thing is really just a "make sure to use the same
analyser to query which you used to index" problem. So if you indexed
using a Lucene 3 analyser, then you should be using the same v3
analyser when you query against the index in Lucene 4.

So the usual rules apply:
  * Beware of Version.LATEST/LUCENE_CURRENT. Always use the exact
version, and keep using it.
  * If Lucene remove support for some Version you were using, don't
update the Version you're using. Instead, take a copy of the
Tokenizer/TokenFilter you were using from the older version and port
it to work on the new version. Maintain these frozen off analysis
components forever.

But that said, we didn't experience any problems like this from 3 to
4, but rather obscure problems where backwards compatibility was not
maintained in Lucene itself, e.g. places where despite passing in a
Version object, the older behaviour was not maintained. IIRC, the term
length limits being changed was one of these. And in these situations,
for the most part, freezing off a copy of the old behaviour works
fine.

That said, we don't use the "classic" query parser, but rather the
flexible one. And maybe if you're using the classic one, it might have
some misbehaviour around this which we didn't strike by using the
flexible one.

> So we need a way to write documents in 3.x format (no ), to our upgraded 
> indexes,
> new indexes can use native 4.10 format.

It sounds like you just need to use the same analyser you were
previously using, possibly forever...

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [Deep Esoterica] How do point codecs work?

2017-01-24 Thread Trejkaz
On Tue, Jan 24, 2017 at 10:21 PM, Michael McCandless
<luc...@mikemccandless.com> wrote:
> Hi Trejkaz,
>
> A normal codec would call visitor.compare on smaller and smaller cells
> (1D ranges for the 1D case) of the byte[] space and depending on that
> result would call one of the visit methods, or skip that cell if
> compare returned CELL_OUTSIDE_QUERY.
>
> I think for your case (to migrate all values) you could simply call
> visitor.visit(int docID, byte[] packedValue) for every value in the
> segment.

This does appear to be working. Although admittedly I don't have any
very good tests yet. :)

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



[Deep Esoterica] How do point codecs work?

2017-01-23 Thread Trejkaz
Hi all.

I'm, considering writing a migration to copy existing doc values into
points (after which I will discard their postings). So essentially I
have to implement three things:

public void intersect(String fieldName, IntersectVisitor visitor)
throws IOException
public byte[] getMinPackedValue(String fieldName) throws IOException
public byte[] getMaxPackedValue(String fieldName) throws IOException

It looks like LongPoint.encodeDimension is the encoding I want to
convert the values themselves.

The docs on this "intersect" thing are talking about a query... I
assume the visitor itself is conceptually the query in this context...

...but is the right thing to do here just to call visit() (ah, but
which one?) for every value? In a particular order? ;)   Or is there
something even more tricky I'm supposed to do?


TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Where did earthDiameter go?

2017-01-20 Thread Trejkaz
On Wed, Jan 18, 2017 at 5:43 AM, Adrien Grand  wrote:
> I think the reason why there was no deprecation notice is that this code
> was considered as internal code rather than something that we explicitly
> expose to users as an API.

Hmm...

http://lucene.apache.org/core/5_5_3/core/org/apache/lucene/util/SloppyMath.html

Public class, public methods, no @lucene.internal Javadoc tag... Looks like
public API to me. Probably looked like public API to our other developer who
added the call, too. Probably should at least slap the tag on the docs if we're
not supposed to use it. I'm sure we have calls to the other methods, too. :/

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Replacement for Filter-as-abstract-class in Lucene 5.4?

2017-01-17 Thread Trejkaz
On Wed, Jan 18, 2017 at 6:07 AM, Adrien Grand  wrote:
>
> We are open to feedback, what issues are you having with
> ConstantScoreWeight? It is true that it does not bring much compared to
> Weight anymore now that we removed query normalization. The only useful
> thing it has is the default explain() implementation.

I guess I just don't have any good examples to follow for how to
implement it, and
Filter itself, for example, wasn't using it either. Plus there was additional
convenience in not having to make a Query subclass...

At the moment I have pulled solr's copy of it and rolled the code from
that up into
the next abstract class up which we had.

Actually, I'm looking at the direct subclasses of that as well, and it
seems like
there are two common cases:

  1) Queries where we get the DocIdSet from some external source like a
 database, which we could possibly switch to some kind of numeric values
 / point set query.

  2) Queries where we do something like a TermsQuery but without keeping all
 the terms in memory at the same time... which there might be another way
 to do, but I'm not really sure.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Weird corruption symptom, not making sense

2017-01-16 Thread Trejkaz
On Tue, Jan 17, 2017 at 9:31 AM, Uwe Schindler  wrote:
> ...or a JVM bug. We have seen those around PagedBytes in the past. What Java 
> version?

Actually I just did a bit more digging and found this:

https://issues.apache.org/jira/browse/LUCENE-6948

And of course, even though we're on a later version, our copy of
UninvertingReader was cloned from before that fix.  *facepalm*

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Weird corruption symptom, not making sense

2017-01-16 Thread Trejkaz
I have this thing where our UninvertingReader is getting an
ArrayIndexOutOfBoundsException in production. I'm sure the index is
corrupt, but I tried investigating the code and it still seems a bit
odd.

Caused by: java.lang.ArrayIndexOutOfBoundsException: -48116
at org.apache.lucene.util.PagedBytes$Reader.fill(PagedBytes.java:118)
at OurFieldCacheImpl.BinaryDocValuesImpl.get(SourceFile:844)

In BinaryDocValuesImpl :

return new BinaryDocValues()
{
@Override
public BytesRef get(int docID)
{
int pointer = (int) docToOffset.get(docID);
if (pointer == 0) {
term.length = 0;
} else {
bytes.fill(term, pointer);
}
return term;
}
};

So "pointer" is the negative value presumably? Implying that somehow a
negative value got into the docToOffset mappings.
The value it's putting in comes from:

long pointer = bytes.copyUsingLengthPrefix(term);
postingsEnum = termsEnum.postings(postingsEnum, PostingsEnum.NONE);
while (true) {
int docID = postingsEnum.nextDoc();
if (docID == DocIdSetIterator.NO_MORE_DOCS) {
break;
}
docToOffset.set(docID, pointer);
}

So it seems like bytes.copyUsingLengthPrefix can return a negative
value? But I looked in PagedBytes and couldn't see an obvious way to
get a negative value.

Is it possible some kind of overflow is happening here?

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Replacement for Filter-as-abstract-class in Lucene 5.4?

2017-01-11 Thread Trejkaz
On Thu, Jan 12, 2017 at 1:02 PM, Kumaran Ramasubramanian
 wrote:
> I always use filter when i need to add more than 1024 ( for no scoring
> cases ).  If filter is removed in lucene 6, what will happen to
> maxbooleanclauses limit? Am i missing anything?

That sounds like a BooleanQuery with FILTER clauses to me, a totally
different use case. I'm talking about using Filter as a base class to
implement queries where you already have the results as a set of some
sort without scoring.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Where did earthDiameter go?

2017-01-11 Thread Trejkaz
Hi.

I don't know why, but we have some kind of esoteric logic in our own
code to simplify a circle on the Earth to a bounding box, clearly
something to do with computing geo queries.

double lonMin = -180.0, lonMax = 180.0;
if (!closeToPole(latMin, latMax)) {
   double D = SloppyMath.earthDiameter(lat);
double d = D * Math.sin((90.0 - lat) * Math.PI / 180.0); //
diameter of a disk formed by parallel at latitude = lat
   double kmPerLonDeg = Math.PI * d / 360.0;
double distanceInLonDeg = distanceKm / kmPerLonDeg;
   lonMin = lon - distanceInLonDeg;
lonMax = lon + distanceInLonDeg;
}

This SloppyMath.earthDiameter(latitude) method appears to be gone in
v6.3.0 but I don't see any mention of a replacement in the changelog.
Is there a replacement? Do I just slot in a constant and hope that
nobody notices? I mean, if the maths are supposed to be "sloppy"... :D

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Replacement for Filter-as-abstract-class in Lucene 5.4?

2017-01-11 Thread Trejkaz
On Thu, Jan 21, 2016 at 4:25 AM, Adrien Grand  wrote:
> Uwe, maybe we could promote ConstantScoreWeight to an experimental API and
> document how to build simple queries based on it?

In the future now, looking at Lucene 6.3 Javadocs, where Filter is now
gone, and it seems that ConstantScoreWeight is still @lucene.internal
(and awfully hard to understand how it can do much at all...). Did we
ever get a replacement class for this use case for Filter? I read
something about solr taking a copy of the class over in its code,
which might be what we have to do here, but I wanted to check first.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: CPU usage 100% during search

2017-01-02 Thread Trejkaz
On Tue, Jan 3, 2017 at 5:26 AM, Rajnish kamboj  wrote:
>
> Hi
>
> The CPU usage goes upto 100% during search.

Isn't that ideal? Or would you prefer your searches to be slow, blocked by I/O?

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Email id tokenizer (actual email id & multiple terms)

2016-12-21 Thread Trejkaz
On Wed, Dec 21, 2016 at 11:23 PM, suriya prakash  wrote:
> Hi,
>
> Thanks for your reply.
>
> I might have one or more emailds in a single record.

Just so you know, you can add the same field more than once with the
field analysed by KeywordAnalyzer, and it will still become multiple
tokens. This is safer than something like WhitespaceAnalyzer, because
email addresses can actually contain spaces. (UAX29URLEmailAnalyzer
might do the right thing though.)

But if you're doing this in the main text content field,
TeeSinkTokenFilter does seem like the right thing to use. (I have
never found a use for it myself.)

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Email id tokenizer (actual email id & multiple terms)

2016-12-20 Thread Trejkaz
On Wed, Dec 21, 2016 at 1:21 AM, Ahmet Arslan  wrote:
> Hi,
>
> You can index whole address in a separate field.
> Otherwise, how would you handle positions of the split tokens?
>
> By the way, speed of phrase search may be just fine, so consider trying first.

Speed aside, phrase search is difficult because you'll accidentally
match too much.
(u...@company.com will match u...@company.com.au, j...@gmail.com will
match little.j...@gmail.com, etc.)

Using a separate field for non-tokenised addresses would be my
recommendation too.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Opposite of SpanFirstQuery - Searching for documents by last term in a field

2016-12-13 Thread Trejkaz
On Wed, Dec 12, 2012 at 3:04 AM, Ian Lea  wrote:
> The javadoc for SpanFirstQuery says it is a special case of
> SpanPositionRangeQuery so maybe you can use the latter directly,
> although you might need to know the position of the last term which
> might be a problem.
>
> Alternatives might include reversing the terms and using SpanFirst or
> adding a special "thisistheend" token to each field and using
> SpanNearQuery for dog and thisistheend with suitable value for slop
> and inOrder = true.
>
> Or take the last term and index it in a separate field so you can just
> search for lastterm: dog.

Idly wondering whether anyone has figured out a good way yet in the
time elapsed since last asked.

Here's my problems with the existing ideas:

1. (Using SpanPositionRangeQuery) I am not really sure how to get the
position of the last term.

2. (Using a special token) Adding a token to every document skews term
statistics and requires manually filtering it out of term listings.
Additionally it ruins certain wildcard queries like field:* since now
every field will match.

3. (Indexing the last term(s) in a separate field) In our case we
don't know how far from the end of the content the user will enter
into the query. They might write:

  term w/10 end-of-content
  term w/1000 end-of-content
  ...

Other ideas:

4. Storing all the content twice initially seems to be a potential
solution, but starts looking very hard once you combine queries. For
instance, what about this:

  (term w/10 start-of-content) w/30 (another-term w/10 end-of-content)

5. Put a payload the last term and then _somehow_ (I have no idea how
payload queries work yet) use payload queries to do spans from that.


Is there any good solution to this that people have already figured
out? Is there another SpanPositionCheckQuery subclass that could be
written which somehow fetches the last position in the document from
the acceptPosition method?

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



ReaderManager, more drama with things not being closed before closing the Directory

2016-10-19 Thread Trejkaz
Hi all.

I seem to have a situation where ReaderManager is reducing a refCount
to 0 before it actually releases all its references.

It's difficult because it's all mixed up in our framework for multiple
ReaderManagers, which I'm still not convinced works because the
concurrency is impossible to figure out, and probably won't be allowed
to publish in order to have anyone at Lucene look at it either. (Which
is why I hope that someone at Lucene figures out how to manage more
than one index reliably one day...)

The stack trace trying to close the directory is just trying to
refresh the reader, but I guess this reader was the last one using a
Directory, so now we're closing that as well:

java.lang.RuntimeException: Resources inside the directory did not
get closed before closing the directory
at 
com.acme.storage.textindex.store.CloseCheckingDirectory.close(CloseCheckingDirectory.java:109)
at 
com.acme.storage.textindex.index.DefaultIndexReaderSharer$IndexReaderWrapper.release(DefaultIndexReaderSharer.java:146)
at 
com.acme.storage.textindex.index.DefaultIndexReaderSharer$IndexReaderWrapper.access$100(DefaultIndexReaderSharer.java:77)
at 
com.acme.storage.textindex.index.DefaultIndexReaderSharer.release(DefaultIndexReaderSharer.java:45)
at 
com.acme.storage.textindex.DefaultTextIndex$WrappingReaderManager$1.doClose(DefaultTextIndex.java:370)
at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:253)
at 
com.acme.storage.textindex.DefaultTextIndex$WrappingReaderManager.decRef(DefaultTextIndex.java:331)
at 
com.acme.storage.textindex.DefaultTextIndex$WrappingReaderManager.decRef(DefaultTextIndex.java:306)
at 
org.apache.lucene.search.ReferenceManager.release(ReferenceManager.java:274)
at 
org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:189)
at 
org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)

The stack trace which opened the resource and didn't close it is
apparently the first reader which ReaderManager:

Caused by: java.lang.RuntimeException: unclosed IndexInput: _7d.tvd
at 
com.acme.storage.textindex.store.CloseCheckingDirectory.addOpenResource(CloseCheckingDirectory.java:82)
at 
com.acme.storage.textindex.store.CloseCheckingDirectory.openInput(CloseCheckingDirectory.java:57)
at 
org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.(CompressingTermVectorsReader.java:144)
at 
org.apache.lucene.codecs.compressing.CompressingTermVectorsFormat.vectorsReader(CompressingTermVectorsFormat.java:91)
at 
org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:120)
at org.apache.lucene.index.SegmentReader.(SegmentReader.java:65)
at 
org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:58)
at 
org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:50)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:731)
at 
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:50)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:63)
at 
com.acme.storage.textindex.index.DefaultIndexReaderSharer$CustomReaderManager.(DefaultIndexReaderSharer.java:164)

But if it's the first reader held by the ReaderManager, I wouldn't
expect the refCount to be 0, so it shouldn't be closing the directory.

I can't reproduce this myself, so I can't just dump out conveniently
placed messages to figure out how it's happening...

But has anyone else seen something like this?

CustomReaderManager is probably shareable, it just does this:

private static class CustomReaderManager extends
ReferenceManager {
private CustomReaderManager(Directory directory) throws IOException {
current =
UnInvertingDirectoryReader.wrap(DirectoryReader.open(directory));
}

@Override
protected void decRef(DirectoryReader reference) throws IOException {
reference.decRef();
}

@Override
protected DirectoryReader refreshIfNeeded(DirectoryReader
referenceToRefresh) throws IOException {
return DirectoryReader.openIfChanged(referenceToRefresh);
}

@Override
protected boolean tryIncRef(DirectoryReader reference) {
return reference.tryIncRef();
}

@Override
protected int getRefCount(DirectoryReader reference) {
return reference.getRefCount();
}
}

So basically the same as the normal one, except that it wraps the
reader in an UnInvertingDirectoryReader. The only reason we're forced
to subclass the manager to do this is that if we don't, each
UnInvertingDirectoryReader becomes a new instance, and basic caching
stuff stops working in some way.

DefaultIndexReaderSharer#release() is 

Re: What does "found existing value for PerFieldPostingsFormat.format" mean?

2016-10-17 Thread Trejkaz
Continuation, found a bug but I'm not sure whether it's in Lucene or
Lucene's Javadoc.

In MultiFields:

  @SuppressWarnings({"unchecked","rawtypes"})
  @Override
  public Iterator iterator() {
Iterator subIterators[] = new Iterator[subs.length];
for(int i=0;i(subIterators);
  }

MergedIterator says in the Javadoc:

"The behavior is undefined if the iterators are not actually sorted."

And indeed, the iterators are _not_ actually sorted. So I look at
where they come from, Fields#iterator(), which is documented fairly
tersely:

"Returns an iterator that will step through all fields names.
This will not return null."

Which doesn't say anything about the names being in order. So I assume
that either:

  (a) Fields#iterator() is actually supposed to be sorted and the
documentation should specify it but doesn't, or

  (b) Fields#iterator() is not supposed to be sorted, but either
MultiFields#iterator() or MergedIterator is supposed to be handling
this better.

Either way, I think it's a bug in Lucene. But since I don't know which
direction it's in, and I don't have a reproducible test case I can
just hand over, I can't easily file it. :/

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What does "found existing value for PerFieldPostingsFormat.format" mean?

2016-10-17 Thread Trejkaz
Additional investigation:

The index has two segments. Both segments have this "path-position" in
the FieldInfo only once. The settings look the same:

FieldInfo in first sub-reader:
name = "path-position"
number = 6
docValuesType = NONE
storeTermVector = false
omitNorms = true
indexOptions = DOCS_AND_FREQS_AND_POSITIONS
storePayloads = false
attributes =
"PerFieldPostingsFormat.format" -> "Lucene50"
"PerFieldPostingsFormat.suffix" -> "0"
dvGen = -1

FieldInfo in second sub-reader:
name = "path-position"
number = 6
docValuesType = NONE
storeTermVector = false
omitNorms = true
indexOptions = DOCS_AND_FREQS_AND_POSITIONS
storePayloads = false
attributes =
"PerFieldPostingsFormat.format" -> "Lucene50"
"PerFieldPostingsFormat.suffix" -> "0"
dvGen = -1

So I'm confused. addIndexes, I thought, merged the data from the given
readers into the destination writer. And here I have two fields with
the same name, number and every other setting, and somehow it's
failing to merge them because when it gets to the second one, it fails
because the first one existed already... which to me, seems like the
point of merging, but maybe that's just me.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



What does "found existing value for PerFieldPostingsFormat.format" mean?

2016-10-17 Thread Trejkaz
Hi all.

Does anyone know what this error message means?

found existing value for PerFieldPostingsFormat.format,
field=path-position, old=Lucene50, new=Lucene50
java.lang.IllegalStateException: found existing value for
PerFieldPostingsFormat.format, field=path-position, old=Lucene50,
new=Lucene50
at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:170)
at 
org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:105)
at 
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:193)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:95)
at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:2629)

We're doing fancy migrations which perform index changes by overriding
FilterCodecReader and copying into a new index, but in this particular
case the migration is only *deleting* values from the index, so it
seems odd that I'd get this particular error.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Performance of Prefix, Wildcard and Regex queries?

2016-10-16 Thread Trejkaz
On Sat, Oct 15, 2016 at 1:21 AM, Rajnish Kamboj  wrote:
> Hi
>
> Performance of Prefix, Wildcard and Regex queries?
> Does Lucene internally optimizes this (using rewrite or something else) or
> I have to manually create specific queries depending on input pattern.
>
> Example
> if input is 78* create Prefix query
> if input is 87?98* create Wildcard query
> if input is 87[7-5]* create Regex query.

I think QueryParser already takes care of converting to PrefixQuery
when possible.

Regexes aren't really possible, though. Consider this:

abc* (wildcard query, matching abc followed by anything)

Versus this:

abc*  (regex query, matching ab followed by 0 or more c)

I think for that, you're going to want additional syntax in your query parser.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene Query Parser Special Characters

2016-10-13 Thread Trejkaz
On Fri, Oct 14, 2016 at 2:47 AM, Ashley Ryan  wrote:
> Obviously, our work around of escaping the angle brackets works as we need
> it to, but it seems to me that your documentation is incorrect. Am I
> misunderstanding the documentation or conflating the issue I'm seeing with
> the topic of special characters?

Maybe the documentation you're reading is for the older QueryParser
and not StandardQueryParser? Neither StandardQueryParser and
StandardSyntaxParser appear to say anything about special
characters... or indeed very much about the syntax at all, which is a
bit of a gap.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: complex disjoint search query

2016-10-12 Thread Trejkaz
On Thu, Oct 13, 2016 at 5:04 AM, Mikhail Khludnev  wrote:
> Hello,
> Why not "To:local.one -(To:[* TO local.one} To:{local.one TO *)" ?

That would not match example 2:

> 2. To:other.one, third.one,

This alone would match 1 and 2, but not 3:

To:[* TO local.one} OR To:{local.one TO *)

Except the performance would be rubbish if you had a large enough
number of domains.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Default LRUQueryCache causing OOO exception

2016-10-12 Thread Trejkaz
On Thu, Oct 13, 2016 at 6:32 AM, Michael McCandless
 wrote:
> You must be calling SearcherManager.maybeRefresh periodically, which
> does open new NRT readers.
>
> Can you please triple check that you do in fact always release() after
> an acquire(), in a finally clause?

For what it's worth, I found this API particularly hard to use.

1. I would prefer a withReader(Callback) kind of method to separate
methods to acquire and release. It makes it impossible to forget to
call the release method and now that lambdas are in the language, it
looks a hell of a lot tidier than try-finally.

2. If there has to be some kind of cleanup I'm supposed to perform on
an object, I prefer that to be done in close() so that I can use
try-with-resources like with any other object that I'm expected to
close.

  2b. The API design is doubly bad, because the object it returns
*does* have a close() method, but "no, you're not allowed to call
that, you have to use this other method over here which almost every
developer on the team will get wrong every single time".

3. I wish there had been a version which could keep track of the same
stuff for multiple indexes, since getting that to work reliably has
been nearly impossible. (I think we're there right now, but I have no
way to prove it!)

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Crazy increase of MultiPhraseQuery memory usage in Lucene 5 (compared with 3)

2016-10-06 Thread Trejkaz
Thought I would try some thread necromancy here, because nobody
replied about this a year ago.

Now we're on 5.4.1 and the numbers changed a bit again. Recording best
times for each operation.

Indexing: 5.723 s
SpanQuery: 25.13 s
MultiPhraseQuery: (waited 10 minutes and it hasn't completed yet)
TermAutomatonQuery: 19.72 s

So it seems like span query performance is slightly better than it was
in 5.2, but MultiPhraseQuery is still no good, and TermAutomatonQuery
might be better than both.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What version is this index?

2016-09-19 Thread Trejkaz
On Mon, Sep 19, 2016 at 3:41 PM, Trejkaz <trej...@trypticon.org> wrote:
> The version checking code then says that because format < 9 and format >= 11,
> the index must be Lucene 3.0.

Obviously I meant format < -9 and format >= -11. Just in case this
confuses anyone.

Also as it turns out, the same version detection code run on
segments_1 decides that segments_1 was created by Lucene 3.x.

Also, if I try experimentally deleting one of the segments files and
upgrading the index anyway, it doesn't matter which one I delete, I
end up with a working index. Albeit two different results...

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



What version is this index?

2016-09-18 Thread Trejkaz
Hi all.

I have an index in my hands where we have:

  1197474657 _0.fdt
  270297 _0.fdx
7737 _0.fnm
 520 _0.si
   377812472 _0.tvd
  216765 _0.tvx
   182245906 _0_Lucene50_0.doc
  4121910583 _0_Lucene50_0.pos
   197539330 _0_Lucene50_0.tim
 2329869 _0_Lucene50_0.tip
  358614 _0_Lucene54_0.dvd
 124 _0_Lucene54_0.dvm
  2860857474 _d.fdt
 1147324 _d.fdx
 996 _d.fnm
   282663967 _d.frq
  4350938982 _d.prx
  130478 _d.tii
   165854648 _d.tis
 2261082 _d.tvd
   656372080 _d.tvf
 2294644 _d.tvx
  20 segments.gen
 136 segments_1
 270 segments_2

When we open the index to try and guess the current version, we only read
the current commit, so segments_2.
We read the first int from this file, and it is -11.

The version checking code then says that because format < 9 and format >=
11, the index must be Lucene 3.0.

But then there are files like _0_Lucene50_0.pos in this index which clearly
*can't* be from Lucene 3. And lo and behold, when we open the index using
Lucene 4's IndexUpgrader, it fails, saying that the index format is too new.

So is this normal? Is it legit to have a segments file supposedly created
by v3, even though all the files in the index appear to be created by v5?
Should our version guesser also be opening all the individual files and
checking something in there?

Is the presence of multiple segments_N files somehow related?

Here's a dump of the segments files:

segments_2:

 fff5  0155 c5b7 7ae3  000e
 0001 0533 2e36 2e32 025f 6400 0230
37ff      ff01 
 ff00  0001  0009 026f 7309
5769 6e64 6f77 7320 380b 6a61 7661 2e76
656e 646f 7212 4f72 6163 6c65 2043 6f72
706f 7261 7469 6f6e 0c6a 6176 612e 7665
7273 696f 6e08 312e 382e 305f 3035 0e6c
7563 656e 652e 7665 7273 696f 6e24 332e
362e 322d 534e 4150 5348 4f54 202d 2032
3031 342d 3031 2d31 3620 3136 3a31 343a
3134 136d 6572 6765 4d61 784e 756d 5365
676d 656e 7473 0131 076f 732e 6172 6368
0561 6d64 3634 0673 6f75 7263 6505 6d65
7267 650b 6d65 7267 6546 6163 746f 7202
3133 0a6f 732e 7665 7273 696f 6e03 362e
3201     06a4 863c

segments_1:

3fd7 6c17 0873 6567 6d65 6e74 7300 
06ea 9f7c fe4d baa3 64c9 9e10 af2d 052c
5201 3105 0401    0004 
0001  0001 0504 0102 5f30 01ea 9f7c
fe4d baa3 64c9 9e10 af2d 052c 5108 4c75
6365 6e65 3534     
       
    c028 93e8  
  809b 4eea

segments.gen:  (appears to indicate that the current commit is segments_2)

 fffe    0002  
 0002




TX


Re: MultiFields#getTerms docs clarification

2016-08-30 Thread Trejkaz
On Mon, Aug 29, 2016 at 8:23 PM, Michael McCandless
 wrote:
> Seems like you need to scrutinize exactly what documents were indexed in step 
> 3?
>
> How exactly did you copy documents out of the old index?  Note that
> when Lucene's IndexReader returns a Document, it's not the same
> Document that was indexed in the first place: it will only have fields
> that were stored, and it does not store certain metadata about how
> those field values were indexed.  But I don't see how that alone can
> lead to indexing an empty string token.

The root cause is that, apparently, in some older version, we *did*
index an empty field, which at some point later had already been fixed
by someone else. I verified that this empty field was in fact present
in the stored fields for the document before the index was migrated to
Lucene 5.

So the only obvious difference then is between Lucene 3 indexing no
tokens for this field, and Lucene 5 indexing a single empty token?

I have ended up putting in a migration to delete the spurious empty
term in the postings as well as deleting the empty field from all the
documents where it's present.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: MultiFields#getTerms docs clarification

2016-08-28 Thread Trejkaz
Updating this with newly-obtained info.

1. The original index was created in Lucene 3.x. In 3.x, if I call
getMin(), it returns non-empty values. So far so good.

2. The index then gets migrated to 5.x using multiple IndexUpgrader
steps. Now, when I call getMin(), it still returns a non-empty value.

3. At some point, the user performs an operation where we copy
documents out of the current index into a new index. When we get the
Document, it has the field in question, even though no value was set
into the field. This then gets indexed, and when the destination index
is finally opened, getMin() returns an empty string.

Something doesn't quite add up though.

Surely if we had put an empty string into a field back in 3.x, it
would have indexed it, and then getMin() would have always returned
the empty string, but that isn't what we're seeing at all. Even after
upgrading the index to the 5.x format, getMin() still returns the
lowest real value. Therefore, it seems reasonable to assume that we
weren't putting the empty field into the document. But if we didn't
put it into the document, why is the field now coming back in Lucene
5.x?

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Unknown type flag: 6 at CompressingStoredFieldsReader.readField

2016-08-23 Thread Trejkaz
Hi all.

Someone apparently got this assertion failure on one of their indexes
which they had been storing on a network drive for some stupid reason:

AssertionError: Unknown type flag: 6 at
org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.readField(CompressingStoredFieldsReader.java:237)

The valid values:

static final int STRING = 0x00;
static final int   BYTE_ARR = 0x01;
static final intNUMERIC_INT = 0x02;
static final int  NUMERIC_FLOAT = 0x03;
static final int   NUMERIC_LONG = 0x04;
static final int NUMERIC_DOUBLE = 0x05;

It seems a bit too coincidental that a 0x06 has appeared, but could
this just be a 1 in 256 chance? Or is there some way 0x06 could have
ended up there?

As usual, no reproduction has been given to us.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: docid is just a signed int32

2016-08-18 Thread Trejkaz
On Thu, Aug 18, 2016 at 11:55 PM, Adrien Grand  wrote:
> No, IndexWriter enforces that the number of documents cannot go over
> IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
> BaseCompositeReader computes the number of documents in a long variable and
> ensures it is less than 2^31, so you cannot have indexes that contain more
> than 2^31 documents.
>
> Larger collections should be written to multiple shards and use
> TopDocs.merge to merge results.

But hang on:
* TopDocs#merge still returns a TopDocs.
* TopDocs still uses an array of ScoreDoc.
* ScoreDoc still uses an int doc ID.

Looks like you're still screwed.

I wish IndexReader would use long IDs too, because one IndexReader can
be across multiple shards too - it doesn't make much sense to me that
this is restricted, although "it's hard to fix in a
backwards-compatible way" is certainly a good reason. :D

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: MultiFields#getTerms docs clarification

2016-08-13 Thread Trejkaz
On Fri, Aug 12, 2016 at 11:51 PM, Michael McCandless
 wrote:
> Getting an empty BytesRef back from Terms.getMin() means Lucene thinks you
> indexed an empty (zero length) token.  Lucene (unfortunately) allows this.
> Is it possible you did that?
>
> If not, can you make a test case showing this?

I have no idea how they got it to happen either. I'm hoping a
reproduction will be provided so that I can figure out how it could
have happened and then hopefully be able to make a test case.

The field is some kind of numeric field, so it *should* have either
been absent or contained a number. But maybe there was some past bug
where it really was an empty string.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



MultiFields#getTerms docs clarification

2016-08-12 Thread Trejkaz
Hi all.

The docs on MultiFields#getTerms state:

> This method may return null if the field does not exist.

Does this mean:

  (a) The method *will* return null if the field does not exist.

  (b) The method will *not necessarily* return null if the field does not exist.

I think we've seen a situation where it somehow returned non-null, but
them Terms#getMin() returned an empty BytesRef, as if we had asked for
an absent value. I would expect getMin() not to count absent values as
the minimum, only because if that were the case, I would have
reproduced the same error during development.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Any compatiblity issue in the upgrade from Lucene Core 3.2.0 to Core 6.1.0?

2016-08-08 Thread Trejkaz
On Tue, Aug 9, 2016 at 12:36 PM, 郑文兴  wrote:
> Then it sounds like that "re-index all the sources in 6.x" is the most 
> feasible way, :(.

If you can, that's what I would do. There are newer features you'll
want to use anyway and migrating in doc values and the like is not the
easiest thing in the world.

In our case, we had no source material to reindex from, so we had to
do it all the hard way.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Any compatiblity issue in the upgrade from Lucene Core 3.2.0 to Core 6.1.0?

2016-08-08 Thread Trejkaz
On Mon, Aug 8, 2016 at 1:37 PM, Erick Erickson  wrote:
> Yes. Lucene only guarantees back-compatibility with
> indexes for one major version. That is, a 4.x release can
> read a 3.x Lucene index. But a 5.x will not read a 3.x.
>
> So you have some options here:
> 1> re-index all your source in 6.x. This is probably easiest
> 2> upgrade in stages, check out the IndexUpgradeTool.
> Here's the doc for 4.x, which should the 3.x version.
>
> I'll leave it to the people who know Lucene better than me
> to opine about whether running the IndexUpgrader several
> times in succession works well.

One problem is that IndexUpgrader and the rest of Lucene has the same
package name in all the subsequent versions. So there is a fiddly bit
of renaming the packages in order to get it all into one classpath.
Actually figuring out how to call the upgrader each time was also a
bit interesting, given that by renaming the packages, the Directory
class had to be adapted in interesting ways to eventually call through
to the current Lucene's version of the Directory class. I think the
InfoStream had some interesting adapting challenges as well.

But even then... I'm not sure IndexUpgrader is 100% flawless. We had
one issue initially where it wasn't upgrading empty indexes at all,
which I think is fixed now. More recently I think we have hit a case
where a failure occurred due to the index being corrupt, but the
upgrade method returned as if it was completely successful... almost
like it was doing the merges in the background, but not waiting until
they complete.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Dubious error message?

2016-08-05 Thread Trejkaz
On Fri, Aug 5, 2016 at 2:51 PM, Erick Erickson  wrote:
> Question 2: Not that I know of
>
> Question 2.1. It's actually pretty difficult to understand why a single _term_
> can be over 32K and still make sense. This is not to say that a
> single _text_ field can't be over 32K, each term within that field
> is (usually) much less than that.
>
> Do you have a real-world use-case where you have a 115K term
> that can _only_ be matched by searching for exactly that
> sequence of 115K characters? Not substrings. Not wildcards. A
> "string" type (as opposed to anything based on solr.Textfield).

This particular field is used to store unique addresses, and for
precision reasons we wanted to search for addresses without tokenising
them, as if you tokenised them, b...@example.com could accidentally
match b...@example.com.au, even though they're two different people. It
also makes statistics faster to calculate.

Now, addresses in SMTP email are fairly short, limited to something
like 254 characters, but sometimes you get data that violates the
standard, and we store more than just that one kind of address, and
maybe one of the other sorts can be longer.

In this situation, it isn't clear whether you can truncate the data,
because if you truncate it, now two addresses are considered equal
when they're not the same string. But then again, if the old version
of Lucene was already truncating it, people might be fine with it
being truncated in the new version. But if they didn't know that,
there would definitely be someone who objects.

So I'm not really saying that the term "makes sense" - I'm just saying
we encountered it in real-world data, and an error occurred. Someone
then complained about the error.

> As far as the error message is concerned, that does seem somewhat opaque.
> Care to raise a JIRA on it (and, if you're really ambitious attach a patch)?

I'll see. :)

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Dubious error message?

2016-08-04 Thread Trejkaz
Trying to add a document, someone saw:

java.lang.IllegalArgumentException: Document contains at least one
immense term in field="bcc-address" (whose UTF8 encoding is longer
than the max length 32766), all of which were skipped.  Please correct
the analyzer to not produce such terms.  The prefix of the first
immense term is: '[00, --omitted--]...', original message: bytes can
be at most 32766 in length; got 115597

Question 1: It says the bytes are being skipped, but to me "skipped"
means it's just going to continue, yet I get this exception. Is that
intentional?

Question 2: Can we turn this check off?

Question 2.1: Why limit in the first place? Every time I have ever
seen someone introduce a limit, it has only been a matter of time
until someone hits it, no matter how improbable it seemed when it was
put in.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Exception in the logs from IndexUpgrader (ArrayIndexOutOfBoundsException from FixedBitSet.set)

2016-08-02 Thread Trejkaz
Hi all.

Someone saw IndexUpgrader from 4.10.4 throw this when upgrading their index:

Caused by: java.lang.ArrayIndexOutOfBoundsException: 191
at org.apache.lucene.util.FixedBitSet.set(FixedBitSet.java:252)
at 
org.apache.lucene.codecs.PostingsConsumer.merge(PostingsConsumer.java:113)
at org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:164)
at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
at 
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:399)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:112)
at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4196)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3784)
at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:409)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:486)

Two questions:

  (1) Does this just mean the index is corrupt?

  (2) Shouldn't an exception merging cause IndexUpgrader to fail? We
only noticed this issue because of an exception from the Lucene _5_
IndexUpgrader on the resulting index, and it found V3 segments in the
index still, but we run the V4 upgrader first.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to get the index for a document after a search over multiple indexes

2016-06-14 Thread Trejkaz
On Wed, Jun 15, 2016 at 6:08 AM, Mark Shapiro  wrote:
> private static IndexSearcher getSearcher( String[] indexDirs ) throws 
> Exception {
>   IndexReader[] readers = new IndexReader[indexDirs.length];
>   FSDirectory[] directorys = new FSDirectory[indexDirs.length];
>
>   for ( int i = 0; i < indexDirs.length; ++i ) {
> File file = new File( indexDirs[i] );
> directorys[i] = ( bWindows ) ? new MMapDirectory( file ) : new 
> NIOFSDirectory( file );
> directorys[i].setReadChunkSize( Integer.MAX_VALUE );
> readers[i] = IndexReader.open( directorys[i], true );
>   }
>
>   MultiReader multiReader = new MultiReader( readers, true );
>   IndexSearcher searcher = new IndexSearcher( multiReader, executorService ); 
>  // ExecutorService
>   return searcher;
> }

OK, so if you want to retrieve the Document, you can call
document(int) on this multiReader and it will automatically take care
of retrieving it from the right underlying reader.

If you want to do it manually, you can keep around the IndexReader[]
and binary search on it to figure out which one contains the doc ID.
Or you can build an int[] of the starting doc IDs for each reader and
use ReaderUtil.subIndex to compute it probably more efficiently. This
is what composite readers use to do the same thing internally.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to get the index for a document after a search over multiple indexes

2016-06-13 Thread Trejkaz
On Tue, Jun 14, 2016 at 9:01 AM, Mark Shapiro  wrote:
> How can I find the single index associated with each Document returned by a 
> search over
> multiple indexes?  The document number is not enough, I want to save the 
> index also so
> that later I can retrieve the file contents that was stored in the index.  
> This question applies
> to Lucene 3.5.0.

How did you do the search over multiple indexes in the first place?

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



LRUQueryCache appears to be messing with me

2016-03-02 Thread Trejkaz
Hi all.

I spent a while trying to track down some weird behaviour where our
custom queries that work off information outside Lucene were returning
out of date information.

It looks like what's happening is that LRUQueryCache compares the
queries, decides that they're the same (equals() does return true) and
returns out of date info.

Is there a recommended way to deal with this? I can think of a few
possibilities.

  * Stub out the caching with a null implementation of QueryCache.
(Current preference. Would slow things down but practically none of
our "filters" are of the type which this thing could correctly cache
without changes.)

  * Try to implement our own QueryCachingPolicy which looks at the
Query object and somehow figures out whether a given query is still
valid. (Probably makes more sense, but doesn't fit very well with our
current model where we programmatically invalidate queries.)

  * Try to implement QueryCache directly on our existing query cache
class. (Seems to make sense too, but difficult because ours takes
Query and Lucene's takes Weight.)

  * Try to change the Query objects such that two queries created from
the same query text end up not being equal(). For instance, keep track
of the last change anyone makes to a database and pull that every time
a query is parsed. Queries with the same time will share data and ones
created after a change occurred would not use the cached data.

Alternatively, maybe there is a supported way to do filters against
external stores by now? It has been quite some time since the last
time someone said "don't do that", so I thought I'd check that too.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is there a way to share IndexReader data sensibly across independent callers?

2016-02-25 Thread Trejkaz
So it turns out I still have problems.

I wanted to return a proxy reader that the caller could close like
normal. I wanted to do this for two reasons:

1. This:

try (IndexReader reader = sharer.acquireReader(...)) {
...
}

Looks much nicer than this:

IndexReader reader = sharer.acquireReader(...));
try {
...
} finally {
sharer.releaseReader(reader);
}

It isn't entirely clear from ReaderManager's docs whether you
*have* to call release or whether calling close() on the reader is
acceptable. Maybe it's fine to call close(), which removes some (but
not all) wrapping we are doing.

2. There have been bugs in the past where somehow readers got closed
more than once. I was hoping to stomp these out by giving out a
different reader to each caller so that we can track which ones have
already been closed and reject the second attempt. And indeed, since
implementing this, we haven't seen that sort of issue occur.

But supposedly the performance isn't good enough. The main cost
appears to be that creating what I thought should be a lightweight
wrapper is surprisingly expensive, because Lucene is maintaining a map
of which readers use which other reasons, which uses
System.identityHashCode, which is apparently an expensive call. (We
always see it inside this, not inside the other methods in
WeakHashMap. I guess native calls are just expensive?)

java.lang.Thread.State: RUNNABLE
at java.lang.System.identityHashCode(Native Method)
at org.apache.lucene.index.IndexReader.hashCode(IndexReader.java:302)
at java.util.WeakHashMap.hash(WeakHashMap.java:298)
at java.util.WeakHashMap.put(WeakHashMap.java:449)
at java.util.Collections$SetFromMap.add(Collections.java:5461)
at 
java.util.Collections$SynchronizedCollection.add(Collections.java:2035)
- locked <0x84067fe8> (a java.util.Collections$SynchronizedSet)
at 
org.apache.lucene.index.IndexReader.registerParentReader(IndexReader.java:138)
at 
org.apache.lucene.index.BaseCompositeReader.(BaseCompositeReader.java:77)
at 
org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:310)
at 
org.apache.lucene.index.FilterDirectoryReader.(FilterDirectoryReader.java:83)
at 
IndexReaderSharer$CloseForwardingDirectoryReader.(IndexReaderSharer:184)

So I'm back to another problem where now I want to reduce the number
of times I create this wrapping reader, yet I don't want to give the
same wrapping reader object out to more than one caller because one of
them could close it twice. I guess I could add yet another layer of
wrapping, make my own IndexReader interface and implement a proxy to
that which is actually cheap, but again it's going to involve writing
hard reference counting code so I'm wondering if there is another way
to avoid this whole mess.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Are "position" and "position increment" actually the exact same concept?

2016-02-14 Thread Trejkaz
On Tue, Feb 9, 2016 at 2:39 AM, András Péteri
 wrote:
> It's only the naming of FieldQueryNode's property that seems ambiguous to
> me. The caller of setPositionIncrement(int), AnalyzerQueryNodeProcessor
> [1], computes absolute term positions and stores that value in the
> property, not the increments. If the term attribute is not available, it'll
> increment the position by 1 for each term/groups of terms.
>
> [1]
> https://github.com/apache/lucene-solr/blob/master/lucene/queryparser/src/java/org/apache/lucene/queryparser/flexible/standard/processors/AnalyzerQueryNodeProcessor.java

Ah, that would explain my issue then. I had a custom builder which was
taking it as the position increment and computing the position and had
wondered why I was getting "sliding" positions only for our own custom
query nodes while somehow the normal phrase queries were working. :)

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is there a way to share IndexReader data sensibly across independent callers?

2016-02-09 Thread Trejkaz
On Wed, Feb 10, 2016 at 3:17 AM, Michael McCandless
 wrote:
> Why do you need to close the Directory?  It should be light weight.
> But if you really do need it, can't you subclass ReaderManager and
> override afterClose to close the directory?

I guess that's the next thing I'll try out. I am already subclassing
ReaderManager to wrap the readers after discovering that reader caches
don't work if I wrap from outside it.

As for the need to close it, it's true, in production we don't really
need to. But we happen to be using a Directory implementation that
checks that callers close things, and that check is only triggered
when we close the Directory at the moment. So it's just something that
adds diagnostics to help resolve other warnings about files not being
closed.

> So you essentially need to "lazy close" your ReaderManager, when there
> are no searches currently needing it?

That's more or less the right way to think about it. Actually each
session might make more than one search using the same reader and
reuse the same one for those, but when no sessions are running for
that index anymore I wanted to close it because Windows has annoying
file locking for read operations. If it weren't for Windows...

> Why not have a sync'd block, with a reference to the ReaderManager.
> Inside that block, if the reference is null, that means it's closed,
> and you open a new one.  Else, use the existing one.  Won't that do
> what you need w/o requiring full fledged reference counts?  Yes, it is
> a sync'd block around acquire/release, but I don't see how that can be
> avoided, and it'd be fast when the ReaderManager is already opened.

This is roughly what I currently have. At the moment I lock my entire
acquire() / release() methods, which maybe it's possible to reduce,
although I'm not entirely sure.

Concurrency is fiddly...

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is there a way to share IndexReader data sensibly across independent callers?

2016-02-08 Thread Trejkaz
On Tue, Feb 9, 2016 at 2:10 AM, Sanne Grinovero
 wrote:
> Hi,
> you should really try to reuse the same opened Directory, like you
> suggest without closing it until your application is "done" with it in
> all its threads (normally on application shutdown).
> Keeping a Directory open will not lead to have open files, that is
> probably caused by not closing the instances of IndexReader.
>
> I'd highly recommend to use the ReaderManager for these reasons,
> especially because handling these details across different threads
> both correctly and efficiently can be tricky - I've learned that
> myself when implementing similar things before the ReaderManager was
> created.

I'm already using ReaderManager, but there are issues.

I want to close it when the last acquired index has been released, but
no thread knows anything about what indexes other threads could be
using, yet we still want indexes to be closed once nobody is using
them. So I end up having to reference count the ReaderManager, which
seems to defeat the purpose of using it in the first place since I
could just reference count the reader itself. I wish it could handle
automatically closing and reopening the index by itself, but I don't
think it can.

At the moment I have bolted this additional level of reference
counting around ReaderManager and it just creates a new ReaderManager
when the reference count goes back up to 1 and closes it when it goes
back to 0. But this blob has to be synchronised to implement it safely
and the map for looking these things up can never clean out entries,
because I couldn't find a safe way to do that even using
ConcurrentMap.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Is there a way to share IndexReader data sensibly across independent callers?

2016-02-04 Thread Trejkaz
Hi all.

Suppose 100 independent callers are opening the same index like this:

try (Directory directory = FSDirectory.open(path1);
 IndexReader reader = DirectoryReader.open(directory))
{
// keeps reader open for a long time
}

Someone complains that we're using a lot of memory, because each
IndexReader has its own cache.
We don't have any context that would tell us about other threads in
the same JVM that might have opened the same reader, other than it
ultimately going through our own code in both cases.

Is there a proper way to share the underlying IndexReader data? I
don't particularly care whether it's the same IndexReader but maybe it
would help if it's at least the same SegmentReaders...

Our existing attempts to use ReaderManager are not really working
because there are issues with figuring out when it's okay to close a
Directory. It seems like the answer is "never", because the caller who
passes in the first one might still be using it. But if you never
close it (which does seem to be the right way), Windows ruins your
plans by refusing to let you delete the files.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Are "position" and "position increment" actually the exact same concept?

2016-02-01 Thread Trejkaz
I found the following code in PhraseQueryNodeBuilder:

PhraseQuery.Builder builder = new PhraseQuery.Builder();
List children = phraseNode.getChildren();
if (children != null) {
for (QueryNode child : children) {
TermQuery termQuery = (TermQuery) child
.getTag(QueryTreeBuilder.QUERY_TREE_BUILDER_TAGID);
FieldQueryNode termNode = (FieldQueryNode) child;

   builder.add(termQuery.getTerm(), termNode.getPositionIncrement());
}
}

Note that:
* termNode.getPositionIncrement() returns a "position increment".
* PhraseQuery.Builder.add(Term,int) takes a "position".

I thought that "position" and "position increment" were two different
things, so I'm confused. Are they actually the same after all?

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Determine if Merge is triggered in SOLR

2016-01-31 Thread Trejkaz
On Mon, Feb 1, 2016 at 5:59 AM, abhi Abhishek  wrote:
> Hi All,
> any suggestions/ ideas?

Start by not cross-posting to irrelevant mailing lists.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



SlopQueryNodeBuilder is wrecking the query the child node generated now?

2016-01-17 Thread Trejkaz
Hi all.

We have a custom QueryNode in our parser which creates a subclass of
PhraseQuery.

I find since updating to 5.3.1, SlopQueryNodeBuilder is replacing it
with a fresh PhraseQuery. Previously, it used to just set the slop on
the existing one, which allowed our custom subclasses straight
through.

The code in there looks a bit esoteric because it appears to build a
second instance of exactly the same query. One might think that it's
trying to clone it to avoid sharing a mutable Query object between two
places, but then in the other branch of the if-else it's letting
MultiPhraseQuery through unharmed, so I'm not really sure what this
change was supposed to achieve.

Needless to say we'll be cloning the old version into our own repo and
replacing it with that.

I'm not sure whether this is considered a bug, but perhaps a good
practice is that if you're going to create a brand new instance, you
check whether the thing you got passed is of that exact class first,
to avoid impacting people who are subclassing your classes.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Fwd: Replacement for Filter-as-abstract-class in Lucene 5.4?

2016-01-14 Thread Trejkaz
Hi all.

Filter is now deprecated, which I already knew was in the pipeline.

The docs say:

   "Use Query objects instead: when queries are wrapped in a
ConstantScoreQuery or in a BooleanClause.Occur.FILTER clause,
they automatically disable the score computation so the Filter
class does not provide benefits compared to queries anymore."

That's fair enough and an easy change to do on the caller side.

The other thing we are using Filter for is the other thing it mentions
in the Javadoc:

   "Convenient base class for building queries that only perform
matching, but no scoring. The scorer produced by such queries
always returns 0 as score."

What is the new convenient way to implement your own queries that
don't do scoring?

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene not closing an IndexOutput when writing?

2015-12-09 Thread Trejkaz
On Wed, Dec 9, 2015 at 10:26 PM, Michael McCandless
 wrote:
> That said, Lucene tries hard to close this file handle, e.g. if an
> in-memory segment is aborted because of e.g. an interrupt exception at
> a bad time.
>
> So, yes, please try to make a test showing that we failed to close it!
>  That's a bad bug if so ... we should never leak file handles.

It looks like things were actually OK. All actual files were being
closed. Phew! :)

What was going on was that when constructing the IndexInput we would
construct the close tracking object before opening the file. It turns
out that sometimes Lucene was passing in a name of a file that doesn't
exist, so in that situation, it got an exception and the close tracker
itself wasn't being closed. So it was raising inaccurate alerts about
things being open that were never opened.

I was tipped off that something like this was going on once I noticed
that the file it was complaining about didn't actually exist...

Daniel

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucene not closing an IndexOutput when writing?

2015-12-08 Thread Trejkaz
We have a Directory implementation that keeps track of who doesn't
close their IndexInput and IndexOutput.

In some test which is attempting to index documents and ultimately
timed out for other reasons (presumably triggering an interrupt,
admittedly not the sort of thing libraries are usually very good at
handling), it records that the following stack opened an IndexOutput
but didn't close it:

java.lang.RuntimeException: unclosed IndexOutput: _0.tvx
at 
com.acme.storage.textindex.store.CloseCheckingDirectory.addOpenResource(CloseCheckingDirectory.java:82)
at 
com.acme.storage.textindex.store.CloseCheckingDirectory.createOutput(CloseCheckingDirectory.java:68)
at 
org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
at 
org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.(CompressingTermVectorsWriter.java:224)
at 
org.apache.lucene.codecs.compressing.CompressingTermVectorsFormat.vectorsWriter(CompressingTermVectorsFormat.java:98)
at 
org.apache.lucene.index.TermVectorsConsumer.initTermVectorsWriter(TermVectorsConsumer.java:88)
at 
org.apache.lucene.index.TermVectorsConsumer.finishDocument(TermVectorsConsumer.java:103)
at org.apache.lucene.index.TermsHash.finishDocument(TermsHash.java:93)
at 
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:316)
at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232)
at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458)
at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1142)

This is in v5.2.1.

I don't know if this is worth a bug record yet, but I thought I would
ask where this is *supposed* to be closed, so that I can try to make a
more direct test and maybe catch it in the act.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Determine whether a MatchAllQuery or a Query with atleast one Term

2015-11-29 Thread Trejkaz
On Mon, Nov 30, 2015 at 4:15 PM, Sandeep Khanzode
 wrote:
> I want to check whether the net effect of this query (bool or otherwise) is a 
> MatchAllQuery (i.e. without
> any terms) or a query with at least one term, or numeric range.

Or both.

*:* OR field:text

The net effect is a MatchAllQuery, *AND* there is at least one term.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Dubious stuff spotted in LowerCaseFilter

2015-10-22 Thread Trejkaz
On Thu, Oct 22, 2015 at 7:05 PM, Uwe Schindler  wrote:
> Hi,
>
>> Setting aside the fact that Character.toLowerCase is already dubious in some 
>> locales (e.g. Turkish),
>
> This is not true. Character.toLowerCase() works locale-independent.
> It is only String.toLowerCase that works using default locale.

Yet if you have a field like "title" and the user and system are
Turkish, the user would expect their locale to apply, yet
LowerCaseFilter will not handle that. So whereas it is "safe" for
English hard-coded strings, it isn't safe for all fields you might
index in general.

Dawid's response shows, though, that at least for the time being,
there is nothing to worry about. Hopefully Unicode will never add a
code point which lowercases to one with less code units (or I guess
changes one of the lower ones to lowercase to more than one...)

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Dubious stuff spotted in LowerCaseFilter

2015-10-21 Thread Trejkaz
Hi all.

LowerCaseFilter uses CharacterUtils.toLowerCase to perform its work.
The latter method looks like this:

public final void toLowerCase(final char[] buffer, final int offset,
final int limit) {
  assert buffer.length >= limit;
  assert offset <=0 && offset <= buffer.length;
  for (int i = offset; i < limit;) {
i += Character.toChars(
Character.toLowerCase(
codePointAt(buffer, i, limit)), buffer, i);
   }
}

Setting aside the fact that Character.toLowerCase is already dubious
in some locales (e.g. Turkish), I notice that this is using the same
"i" index counter to refer to both the source offset and the
destination offset. So basically, this code has an undocumented
assumption that Character.toLowerCase always returns a code point
which takes up the same number of characters as the original one.

Whereas I do suppose that this might be the case, did someone actually
verify it? Say, by iterating all code points or something? How
confident are we that this will continue to be the case forever? :)

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Recommendation for doing a search plus collecting extra information?

2015-10-11 Thread Trejkaz
On Mon, Oct 12, 2015 at 6:32 AM, Alan Woodward <a...@flax.co.uk> wrote:
> Hi Trejkaz,
>
> You can still use a standard collector if you don’t need to worry about 
> multi-threaded search.  It sounds
> as though what you want to do is implement your own Collector that will read 
> and record docvalues hits,
> and use MultiCollector to wrap it and a standard TopDocsCollector together.

I guess the benefit of doing it directly at the Collector is that the
results will come in doc ID order, so any I/O I'm doing would be local
to the previous I/O? Which makes sense, and fetching the values seems
easy enough, but then the order I get the results is not the order
they will come back in the search, so I have to find a fairly
efficient way to map int->int so that I can look them up later.

What would seem ideal here is extending ScoreDoc to put my new int in
that, so that it's stored along with the same object that gets sorted
and ultimately ends up in the array (plus the extra storage
requirement would be as low as possible), but there the ScoreDoc is
created by HitQueue#getSentinelObject() and there is no way to get a
different subclass of HitQueue in TopScoreDocCollector. So I think
this route would require reimplementing pretty much all of
TopScoreDocCollector. I guess it isn't very large, but I worry about
future API changes when messing with internal stuff.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Recommendation for doing a search plus collecting extra information?

2015-10-07 Thread Trejkaz
On Thu, Oct 8, 2015 at 1:16 PM, Erick Erickson  wrote:
> This may be an "XY" problem, you're asking how to do X thinking
> it will solve Y without telling us what Y is.
>
> What do you want to _do_ with the DV values you look up for each hit?

Keep them around as the ID to use to look up information later. i.e.,
what we used to do with the doc ID before Lucene decided the doc ID
wouldn't be stable.

e.g., the search happens at some point, and then later you want to
render a row of a table, so you want to fetch the document. But you
can't use the doc ID to do that, so we use another ID which we map
back to the doc ID once we have a reader for that operation.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Recommendation for doing a search plus collecting extra information?

2015-10-07 Thread Trejkaz
On Thu, Oct 8, 2015 at 1:48 PM, Erick Erickson  wrote:
> First off, the internal Lucene doc ID has never been stable as long as any
> segment merging of whatever style was going on, don't quite know
> where you're getting that idea.
>
> It sounds like what you're really looking for is to export complete result
> sets to "do something with them later". That's what the export capability
> was built for (Solr 4.10 and later). See:
> https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
> Just make your Solr ID ( or whatever) a DV field and
> export..

We don't use Solr and aren't particularly planning to start doing so.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



  1   2   3   >