Re: SpanMultiTermQueryWrapper with PrefixQuery hitting num clause limit

2024-03-28 Thread Robert Muir
using spans and wildcards together is asking for trouble, you will hit limits, it is not efficient by definition. I'd recommend to change your indexing so that your queries are fast and you aren't using wildcards that enumerate many terms at search-time. Don't index words such as

Re: Re-ranking using cross-encoder after vector search (bi-encoder)

2023-02-10 Thread Robert Muir
I think it would be good to provide something like a VectorRerankField (sorry for the bad name, maybe FastVectorField would be amusing too), that just stores vectors as docvalues (no HNSW) and has a newRescorer() method that implements org.apache.lucene.search.Rescorer. Then its easy to do as that

Re: Prioritising certain documents in the search results

2023-02-01 Thread Robert Muir
check out https://lucene.apache.org/core/9_5_0/core/org/apache/lucene/document/FeatureField.html I think this is how you want to do it: it has some suggestions on how to start without training the actual values in the docs, see "if you don't know where to start" On Wed, Feb 1, 2023 at 12:03 PM

Re: Handling Indian regional languages

2023-01-16 Thread Robert Muir
On Tue, Jan 10, 2023 at 2:04 AM Kumaran Ramasubramanian wrote: > > For handling Indian regional languages, what is the advisable approach? > > 1. Indexing each language data(Tamil, Hindi etc) in specific fields like > content_tamil, content_hindi with specific per field Analyzer like Tamil > for

Re: Recurring index corruption

2023-01-02 Thread Robert Muir
Your files are getting truncated. Nothing lucene can do. If this is really the only way you can store data in this azure cloud, and this is how they treat it, then run away... don't just walk... to a different cloud. On Mon, Jan 2, 2023 at 5:19 AM S S wrote: > > We are experimenting with

Re: Is there a way to customize segment names?

2022-12-17 Thread Robert Muir
t;> segments etc) and can pick up from there. You would need a mechanism >> to replay the writes the primary never had a chance to commit. >> >> On Fri, Dec 16, 2022 at 5:41 AM Robert Muir wrote: >> > >> > You are still talking "Multiple writers".

Re: Is there a way to customize segment names?

2022-12-16 Thread Robert Muir
ode (main indexer) is down, how would we recover with > a back up indexer? > > Thanks > Patrick > > > On Thu, Dec 15, 2022 at 7:16 PM Robert Muir wrote: > > > This multiple-writer isn't going to work and customizing names won't > > allow it anyway. Each file

Re: Is there a way to customize segment names?

2022-12-15 Thread Robert Muir
This multiple-writer isn't going to work and customizing names won't allow it anyway. Each file also contains a unique identifier tied to its commit so that we know everything is intact. I would look at the segment replication in lucene/replicator and not try to play games with files and mixing

Re: Integrating NLP into Lucene Analysis Chain

2022-11-19 Thread Robert Muir
https://github.com/apache/lucene/pull/11955 On Sat, Nov 19, 2022 at 10:43 PM Robert Muir wrote: > > Hi, > > Is this 'synchronized' really needed? > > 1. Lucene tokenstreams are only used by a single thread. If you index > with 10 threads, 10 tokenstreams are used. > 2.

Re: Integrating NLP into Lucene Analysis Chain

2022-11-19 Thread Robert Muir
Hi, Is this 'synchronized' really needed? 1. Lucene tokenstreams are only used by a single thread. If you index with 10 threads, 10 tokenstreams are used. 2. These OpenNLP Factories make a new *Op for each tokenstream that they create. so there's no thread hazard. 3. If i remove 'synchronized'

Re: pagination with searchAfter

2022-09-24 Thread Robert Muir
You don't need a server-side cache as the searchAfter value has all the information, it is just your "current position". For example if you are sorting by ID and you return IDs 1,2,3,4,5, the searchAfter value is basically 5. So when you query the next time with that searchAfter=5, it skips over

Re: Lucene 9.2.0 build fails on Windows

2022-09-14 Thread Robert Muir
I opened an issue with one idea of how we can fix this, for discussion: https://github.com/apache/lucene/issues/11772 On Wed, Sep 14, 2022 at 11:27 AM Uwe Schindler wrote: > > Hi, > > do you have Microsoft Visual Studio installed? It looks like Gradle > tries to detect it and fails with some

Re: Lucene 9.2.0 build fails on Windows

2022-09-13 Thread Robert Muir
Looks to me like a gradle bug, detecting and trying to run some visual studio command (vswhere.exe) elsewhere on your system, and it does the wrong thing parsing its output. On Tue, Sep 13, 2022 at 3:00 PM Rahul Goswami wrote: > > Hi Dawid, > I believe you. Just that for some reason I have never

Re: Index corruption and repair

2022-04-29 Thread Robert Muir
The most helpful thing would be the full stacktrace of the exception. This exception should be chaining the original exception and call site, and maybe tell us more about this error you hit. To me, it looks like a windows-specific issue where the filesystem is returning an unexpected error. So it

Re: How to handle corrupt Lucene index

2022-04-13 Thread Robert Muir
If you are looking at the files in hex, you can see the file format docs online for your version: https://lucene.apache.org/core/7_3_0/core/org/apache/lucene/index/SegmentInfos.html SegID is written right after SegName, it is 16 bytes (128-bit number) On Wed, Apr 13, 2022 at 10:59 PM Robert Muir

Re: How to handle corrupt Lucene index

2022-04-13 Thread Robert Muir
_ > > From: Tim Whittington > > Sent: Wednesday, April 13, 2022 9:17:44 PM > > To: java-user@lucene.apache.org > > Subject: Re: How to handle corrupt Lucene index > > > > Thanks for this - I'll have a look at the database server code that is > &

Re: How to handle corrupt Lucene index

2022-04-13 Thread Robert Muir
On Wed, Apr 13, 2022 at 8:24 PM Tim Whittington wrote: > > I'm working with/on a database system that uses Lucene for full text > indexes (currently using 7.3.0). > We're encountering occasional problems that occur after unclean shutdowns > of the database , resulting in >

Re: Java 17 and Lucene

2021-10-27 Thread Robert Muir
; > > > > > > > > Once I figured out what makes it hang, I will open issues in > > > > > > > OpenJDK > > > > (I > > > > > > am OpenJDK member/editor). I have now many stuck JVMs running to > > > > analyze >

Re: Java 17 and Lucene

2021-10-18 Thread Robert Muir
We test different releases on different platforms (e.g. Linux, Windows, Mac). We also test EA (Early Access) releases of openjdk versions during the development process. This finds bugs before they get released. More information about versions/EA testing: https://jenkins.thetaphi.de/ On Mon, Oct

Re: Search while typing (incremental search)

2021-10-06 Thread Robert Muir
TLDR: use the lucene suggest/ package. Start with building suggester from your query logs (either a file or index them). These have a lot of flexibility about how the matches happen, for example pure prefixes, edit distance typos, infix matching, analysis chain, even now Japanese input-method

Re: NRT readers and overall indexing/querying throughput

2021-08-08 Thread Robert Muir
On Tue, Aug 3, 2021 at 10:43 PM Alexander Lukyanchikov wrote: > > Maybe I have wrong expectations, and less frequent commits with NRT refresh > were not intended to improve overall performance? > > Some details about the tests - > Base implementation commits and refreshes a regular reader every

Re: Tuning MoreLikeThis scoring algorithm

2021-05-28 Thread Robert Muir
See https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages which has some broken nabble links, but is still valid. TLDR: Scoring just doesn't work the way you think. Don't try to interpret it as an absolute value, it is a relative one. On Fri, May 28, 2021 at 1:36 PM TK Solr

Re: Lucene 8 causing app server threads to hang due to high rate of network usage

2021-04-28 Thread Robert Muir
Don't use filesystems such as NFS (that is what EFS is) with lucene! This is really bad design, and it is the root cause of your issue. On Tue, Apr 27, 2021 at 1:21 PM Hilston, Kathleen < kathleen.hils...@snapon.com> wrote: > Hello, > > > > My name is Kathleen Hilston, and I am a Software

Re: CorruptIndexException after failed segment merge caused by No space left on device

2021-03-24 Thread Robert Muir
On Wed, Mar 24, 2021 at 1:41 AM Alexander Lukyanchikov < alexanderlukyanchi...@gmail.com> wrote: > Hello everyone, > > Recently we had a failed segment merge caused by "No space left on device". > After restart, Lucene failed with the CorruptIndexException. > The expectation was that Lucene

Re: Incorrect CollectionStatistics if IndexWriter.close is not called

2021-03-03 Thread Robert Muir
Marc, you don't need to reindex to have less deletes and less impact from this. merging will get rid of the deletes. if updates are coming in batches, you could consider calling IndexWriter.html#forceMergeDeletes after updating a batch to keep things tidy. Otherwise, if updates are coming in

Re: BigIntegerPoint

2021-02-26 Thread Robert Muir
It was added to the sandbox originally (along with InetAddressPoint for ip addresses) and just never graduated from there: https://issues.apache.org/jira/browse/LUCENE-7043 The InetAddressPoint was moved to core because it seems pretty common that people want to do range queries on IP hosts and

Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread Robert Muir
reload? > > Thanks for the explanations. This thread will be useful for many folks i > believe. > > Best regards > > > On 2/23/21 4:15 PM, Robert Muir wrote: > > > > On Tue, Feb 23, 2021 at 4:07 PM wrote: > >> What i want to achieve: Problem statement: >> &

Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread Robert Muir
On Tue, Feb 23, 2021 at 4:07 PM wrote: > What i want to achieve: Problem statement: > > base case is disk based Lucene index with FSDirectory > > speedup case was supposed to be in memory Lucene index with MMapDirectory > On 64-bit systems, FSDirectory just invokes MMapDirectory already. So you

Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread Robert Muir
t; magnitude of speedup from already very fast on disk Lucene indexes. > > So i was expecting really really really fast response with MMapDirectory. > > Thanks > > > On 2/23/21 3:40 PM, Robert Muir wrote: > > Don't give gobs of memory to your java process, you will just make thin

Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread Robert Muir
Don't give gobs of memory to your java process, you will just make things slower. The kernel will cache your index files. On Tue, Feb 23, 2021 at 1:45 PM wrote: > Ok, but how is this MMapDirectory used then? > > Best regards > > > On 2/23/21 7:03 AM, Robert Muir wrote: >

Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread Robert Muir
On Tue, Feb 23, 2021 at 2:30 AM wrote: > Hi,- > > I tried MMapDirectory and i allocated as big as index size on my J2EE > Container but > > Don't allocate java heap memory for the index, MMapDirectory does not use java heap memory!

Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2020-12-14 Thread Robert Muir
On Mon, Dec 14, 2020 at 1:59 PM Uwe Schindler wrote: > > Hi, > > as writer of the original bog post, here my comments: > > Yes, MMapDirectory.setPreload() is the feature mentioned in my blog post is > to load everything into memory - but that does not guarantee anything! > Still, I would not

Re: Multi-IDF for a single term possible?

2019-12-03 Thread Robert Muir
it is enough to give each its own field. On Tue, Dec 3, 2019 at 7:57 AM Adrien Grand wrote: > Is there any reason why you are not storing each DOC_TYPE in its own index? > > On Tue, Dec 3, 2019 at 1:50 PM Ravikumar Govindarajan > wrote: > > > > Hello, > > > > We are using TF-IDF for scoring

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

2019-04-02 Thread Robert Muir
ld.newSlowRangeQuery. > > > > > > Στις Τρί, 2 Απρ 2019 στις 3:01 μ.μ., ο/η Robert Muir > έγραψε: > > > Can you explain a little more about your use-case? I think that's the > > biggest problem here for term range query. Pretty much all range > > use-cases

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

2019-04-02 Thread Robert Muir
do this. So here is where you could step in and improve the terms >> dictionary! >> >> Uwe >> >> > Uwe >> > >> > - >> > Uwe Schindler >> > Achterdiek 19, D-28357 Bremen >> > http://www.thetaphi.de >> > eMail

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

2019-04-01 Thread Robert Muir
urious to understand what IntersectTermsEnum is supposed to > do. > > Στις Δευ, 1 Απρ 2019 στις 5:34 μ.μ., ο/η Robert Muir > έγραψε: > > > Is this IntersectTermsEnum really being used for term range query? Seems > > like using a standard TermsEnum, seeking to the start of the range,

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

2019-04-01 Thread Robert Muir
Is this IntersectTermsEnum really being used for term range query? Seems like using a standard TermsEnum, seeking to the start of the range, then calling next until the end would be easier. On Mon, Apr 1, 2019, 10:05 AM Stamatis Zampetakis wrote: > Hi all, > > I am currently working on

Re: Getting Exception : java.nio.channels.ClosedByInterruptException

2019-04-01 Thread Robert Muir
Some code interrupted (Thread.interrupt) a java thread while it was blocked on I/O. This is not safe to do with lucene, because unfortunately in this situation java's NIO code closes file descriptors and releases locks. The second exception is because the indexwriter tried to write when it no

Re: prorated early termination

2019-02-05 Thread Robert Muir
in TopFieldCollector that > doesn't require any public-facing API change or refactoring at all. It just > terminates a little earlier based on the segment distribution. Here's a PR > so you can see what this is: https://github.com/apache/lucene-solr/pull/564 > > > On Mon

Re: prorated early termination

2019-02-04 Thread Robert Muir
Regarding adding a threshold to TopFieldCollector, do you have ideas on what it would take to fix the relevant collector/indexsearcher APIs to make this kind of thing easier? (i know this is a doozie, but we should at least try to think about it, maybe make some progress) I can see where things

Re: IndexOptions & LongPoints

2018-09-18 Thread Robert Muir
On Tue, Sep 18, 2018 at 7:00 AM, Seth Utecht wrote: > > My concern is that it seems like LongPoint's FieldType has an IndexOptions > that is always NONE. It strikes me as odd, because we are in fact indexing > and searching against these LongPoint fields. > Points fields don't create an inverted

Re: Search in lines, so need to index lines?

2018-08-01 Thread Robert Muir
http://man7.org/linux/man-pages/man1/grep.1.html On Wed, Aug 1, 2018 at 7:01 AM, Gordin, Ira wrote: > Hi Tomoko, > > I need to search in many files and we use Lucene for this purpose. > > Thanks, > Ira > > -Original Message- > From: Tomoko Uchida > Sent: Wednesday, August 1, 2018 1:49

Re: any example on FunctionScoreQuery since Field.setBoost is deprecated with Lucene 6.6.0

2018-07-31 Thread Robert Muir
Does this example help? https://lucene.apache.org/core/7_4_0/expressions/org/apache/lucene/expressions/Expression.html On Tue, Jul 31, 2018 at 3:56 PM, wrote: > The following page says: > > http://lucene.apache.org/core/6_6_0/core/org/apache/lucene/document/Field.html#setBoost-float- > >

Re: offsets

2018-07-31 Thread Robert Muir
The problem is not a performance one, its a complexity thing. Really I think only the tokenizer should be messing with the offsets... They are the ones actually parsing the original content so it makes sense they would produce the pointers back to them. I know there are some tokenfilters out there

Re: offsets

2018-07-25 Thread Robert Muir
I think you see it correctly. Currently, only tokenizers can really safely modify offsets, because only they have access to the correction logic from the charfilter. Doing it from a tokenfilter just means you will have bugs... On Wed, Jul 25, 2018 at 8:27 AM, Michael Sokolov wrote: > I've run

Re: WordDelimiterGraphFilter swallows emojis

2018-07-03 Thread Robert Muir
ood -- this ConditionalTokenFilter is going to be very > helpful. We have overridden the ICUTokenizer's rbbi rules, but I'll poke > around and see about incorporating the emoji rules from there. Thanks > Robert > > On Tue, Jul 3, 2018 at 9:28 AM Robert Muir wrote: > >> > Any thoughts

Re: WordDelimiterGraphFilter swallows emojis

2018-07-03 Thread Robert Muir
> Any thoughts? best idea I have would be to tokenize with ICUTokenizer, which will tag emoji sequences as "" token type, then use ConditionalTokenFilter to send all tokens EXCEPT those with token type of "" to your WordDelimiterFilter. This way WordDelimiterFilter never sees the emoji at all

Re: WordDelimiterGraphFilter swallows emojis

2018-07-03 Thread Robert Muir
On Tue, Jul 3, 2018 at 8:00 AM, Michael Sokolov wrote: > WDGF (and WordDelimiterFilter) treat emoji as "SUBWORD_DELIM" characters > like punctuation and thus remove them, but we would like to be able to > search for emoji and use this filter for handling dashes, dots and other > intra-word

Re: ICUFoldingFilter

2018-06-04 Thread Robert Muir
already have an > earlier component where we can handle this (we have a custom ICUTokenizer > rbbi and can just split on "^"). So many flexibility > > -Mike > > On Mon, Jun 4, 2018 at 10:53 AM, Robert Muir wrote: > >> actually, you now can choose to ignore certain

Re: ICUFoldingFilter

2018-06-04 Thread Robert Muir
actually, you now can choose to ignore certain characters by using unicode filtering mechanism. This was added in https://issues.apache.org/jira/browse/LUCENE-8129 So apply a filter such as [^\^] and the filter will ignore ^. On Mon, Jun 4, 2018 at 10:41 AM, Robert Muir wrote: > This can

Re: ICUFoldingFilter

2018-06-04 Thread Robert Muir
This cannot be "tweaked" at runtime, it is implemented as custom normalization. You can modify the sources / build your own ruleset or use a different tokenfilter to normalize characters. On Mon, Jun 4, 2018 at 9:07 AM, Michael Sokolov wrote: > Hi, I'm using ICUFoldingFilter and for the most

Re: [EXTERNAL] - Lucene 4.5.1 payload corruption - ArrayIndexOutOfBoundsException

2018-02-02 Thread Robert Muir
osition is fine. > > So that at least I can know this segment is newly corrupted one or it is > previous corrupted and merge to a new one. > > > On 2/2/18, 9:58 PM, "Robert Muir" <rcm...@gmail.com> wrote: > > IMO this is not something you want to do. >

Re: [EXTERNAL] - Lucene 4.5.1 payload corruption - ArrayIndexOutOfBoundsException

2018-02-02 Thread Robert Muir
IMO this is not something you want to do. The only remedy CheckIndex has for a corrupted segment is to drop it completely: and if you choose to do that then you lose all the documents in that segment. So its not very useful to merge it with other segments into bigger corrupted segments since it

Re: indexing performance 6.6 vs 7.1

2018-01-18 Thread Robert Muir
Erick I don't think solr was mentioned here. On Thu, Jan 18, 2018 at 8:03 AM, Erick Erickson wrote: > My first question is always "are you running the Solr CPUs flat out?". > My guess in this case is that the indexing client is the same and the > problem is in Solr, but

Re: Help regarding BM25Similarity

2018-01-04 Thread Robert Muir
You don't need to do any subclassing for this: just pass parameter b=0 to the constructor. On Thu, Jan 4, 2018 at 10:58 AM, Parit Bansal wrote: > Hi, > > I am trying to tweak BM25Similarity for my use case wherein, I want to avoid > the effects of field-length

Re: How to use Hunspell dictionary to do the reverse of stemming ?

2017-10-24 Thread Robert Muir
On Tue, Oct 24, 2017 at 11:04 AM, julien Blaize wrote: > Hello, > > i am lookingfor a way to efficiently do the reverse of stemming. > Example : if i give to the program the verb "drug" it will give me > "drugged', "drugging", "drugs", "drugstore" etc... To generate the

Re: Accent insensitive search for greek characters

2017-10-24 Thread Robert Muir
Your greek transform stuff does not work because you use "Lower" instead of casefolding. If ICUFoldingFilter works for what you want, but you want to restrict it to greek, then just restrict it to the greek region. See FilteredNormalizer2 and UnicodeSet documentation. And look at how

Re: ClassicAnalyzer Behavior on accent character

2017-10-19 Thread Robert Muir
easy, don't use classictokenizer: use standardtokenizer instead. On Thu, Oct 19, 2017 at 9:37 AM, Chitra wrote: > Hi, > I indexed a term 'ⒶeŘꝋꝒɫⱯŋɇ' (aeroplane) and the term was > indexed as "er l n", some characters were trimmed while indexing. > > Here is

Re: How to regulate native memory?

2017-08-30 Thread Robert Muir
free shared buff/cache > available > Mem:125 54 1 1 69 > 69 > Swap: 0 0 0 > > Thanks for the reply! Apologies if not apropos to this forum - just working > my way

Re: How to regulate native memory?

2017-08-30 Thread Robert Muir
Hello, >From the thread linked there, its not clear to me the problem relates to lucene (vs being e.g. a bug in netty, or too many threads, or potentially many other problems). Can you first try to determine to breakdown your problematic "RSS" from the operating system? Maybe this helps

Re: Automata and Transducer on Lucene 6

2017-04-18 Thread Robert Muir
On Tue, Apr 18, 2017 at 5:16 PM, Michael McCandless wrote: > > +1 to use the tests to learn how things work; I don't know of any guide / > high level documentation for these low level classes, sorry. Maybe write > it up yourself and set it free somewhere online ;)

Re: Altering Term Frequency in Similarity

2016-12-15 Thread Robert Muir
Maybe have a look at SynonymQuery: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/SynonymQuery.java I think it does a similar thing to what you want, it sums up the frequencies of the synonyms and passes that sum to the similarity class as TF. On

Re: Segment Corruption - ForUtil.readBlock AIOBE

2016-08-08 Thread Robert Muir
Can you run checkindex and include the output? On Mon, Aug 8, 2016 at 2:36 AM, Ravikumar Govindarajan wrote: > For some of the segments we received the following exception during merge > as well as search. They look to be corrupt [Lucene 4.6.1 & Sun JDK >

Re: Problems with Lat / Long searches at minimum and maximum latitude and longitude

2016-06-12 Thread Robert Muir
See this part of the documentation: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/geo/Polygon.java#L30 APIs take newPolygonQuery(Polygon...) which is treated efficiently as a "multipolygon". This is also what many standards (e.g. geojson) recommend,

Re: Lucene DirectSpellChecker strange behavior

2016-06-07 Thread Robert Muir
Its just a heuristic: that it does not allow 2 edits (insertion/deletion/substitution/transposition) to the word if the first character differs ( https://github.com/apache/lucene-solr/blob/master/lucene/suggest/src/java/org/apache/lucene/search/spell/DirectSpellChecker.java#L411). So when it goes

Re: Cannot comment on Jira issues

2016-04-23 Thread Robert Muir
OK should really work now! On Sat, Apr 23, 2016 at 10:37 AM, Andres de la Peña <adelap...@stratio.com> wrote: > I'm still not able to comment, although I have tried to logout and login > again. > > 2016-04-23 15:31 GMT+01:00 Robert Muir <rcm...@gmail.com>: > >

Re: Cannot comment on Jira issues

2016-04-23 Thread Robert Muir
Can you try now? I added you to contributors groups. On Sat, Apr 23, 2016 at 10:26 AM, Andres de la Peña wrote: > Hi, > > I would like to reply to the answer to my comment on LUCENE-7086 > . Could I be temporary > added to

Re: BlendedTermQuery causing negative IDF?

2016-04-19 Thread Robert Muir
The scoring algorithm can't be expected to deal with totally bogus (e.g. mathematically impossible) statistics, such as docFreq > docCount. Many of them may fall apart. We should try to improve that about BlendedTermQuery! SynonymQuery should not really exist. It exists because of problems like

Re: Lucene indexing throughput (and Mike's lucenebench charts)

2016-04-15 Thread Robert Muir
you won't see indexing improvements there because the dataset in question is wikipedia and mostly indexing full text. I think it may have one measly numeric field. On Thu, Apr 14, 2016 at 6:25 PM, Otis Gospodnetić wrote: > (replying to my original email because I

Re: Depreciated IntField field in v6

2016-04-15 Thread Robert Muir
On Fri, Apr 15, 2016 at 11:48 AM, Greg Huber wrote: > Hello, > > I was using the IntField field to set the weight on my suggester. > (LegacyIntField works) > > old: > > document.add(new IntField( > FieldConstants.LUCENE_WEIGHT_LINES, >

Re: SloppyMath license

2015-09-19 Thread Robert Muir
There is nothing unusual about public domain code. If your lawyers do not understand that, tell them to go back to school. On Sat, Sep 19, 2015 at 11:31 AM, Sergii Kabashniuk wrote: > Hello > Right now I'm working on approval to use lucene-core in Eclipse projects. >

Re: SloppyMath license

2015-09-19 Thread Robert Muir
call it public domain, call it attribution-only, whatever you like. there is nothing incompatible with Apache 2, fdlibm was also used by apache harmony for its math code. On Sat, Sep 19, 2015 at 12:33 PM, Earl Hood <earlh...@gmail.com> wrote: > On Sat, Sep 19, 2015 at 11:14 AM, Robert M

Re: Problems with toString at TermsQuery

2015-09-09 Thread Robert Muir
I think its a bug: https://issues.apache.org/jira/browse/LUCENE-6792 On Tue, Sep 8, 2015 at 10:35 AM, Ruslan Muzhikov wrote: > Hi! > Sometimes TermsQuery.toString() method falls with exception: > > *Exception in thread "main" java.lang.AssertionError* > * at

Re: Compressing docValues with variable length bytes[] by block of 16k ?

2015-08-09 Thread Robert Muir
That makes no sense at all, it would make it slow as shit. I am tired of repeating this: Don't use BINARY docvalues Don't use BINARY docvalues Don't use BINARY docvalues Use types like SORTED/SORTED_SET which will compress the term dictionary and make use of ordinals in your application instead.

Re: scanning whole index stored fields while using best compression mode

2015-06-03 Thread Robert Muir
On Wed, Jun 3, 2015 at 4:59 PM, Anton Zenkov azen...@crimsonhexagon.com wrote: Reindexing. If I want to add new fields or change existing fields in the index I need to go through all documents of the index. if your reindexing process needs all the docs, i dont think i can really recommend a

Re: scanning whole index stored fields while using best compression mode

2015-06-03 Thread Robert Muir
On Wed, Jun 3, 2015 at 4:00 PM, Anton Zenkov azen...@crimsonhexagon.com wrote: for (int i = 0; i leafReader.maxDoc(); i++) { DocumentStoredFieldVisitor visitor = new DocumentStoredFieldVisitor(); fieldsReader.visitDocument(i, visitor); visitor.getDocument(); } } I was

Re: IllegalArgumentException: docID must be = 0 and maxDoc=48736112 (got docID=2147483647)

2015-05-29 Thread Robert Muir
Hi Ahmet, Its due to the use of sentinel values by your collector in its priority queue by default. TopScoreDocCollector warns about this, and if you turn on assertions (-ea) you will hit them in your tests: * pbNOTE/b: The values {@link Float#NaN} and * {@link Float#NEGATIVE_INFINITY} are

Re: SortingAtomicReader alternate to Tim-Sort...

2015-04-22 Thread Robert Muir
On Tue, Apr 21, 2015 at 4:00 AM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: b) CompressingStoredFieldsReader did not store the last decoded 32KB chunk. Our segments are already sorted before participating in a merge. On mostly linear merge, we ended up decoding the same

Re: Customizing Regexp syntax in Lucene

2015-04-05 Thread Robert Muir
On Sun, Apr 5, 2015 at 5:08 PM, code fx9 code...@gmail.com wrote: Hi, We are using Lucene indirectly via ElasticSearch. We would like to use RE2 syntax for running regex queries against Lucene. We are already using RE2 syntax for other parts of our system, so not ability to use the same syntax

Re: write.lock is not removed

2015-02-23 Thread Robert Muir
It should not be deleted. Just don't mess with it. On Mon, Feb 23, 2015 at 7:57 AM, Just Spam schlibos...@gmail.com wrote: Hello, i am trying to index a file (Lucene 4.10.3) – in my opinion in the correct way – will say: get the IndexWriter, Index the Doc and add them, prepare commit,

Re: write.lock is not removed

2015-02-23 Thread Robert Muir
Thats why locking didnt work correctly back then. On Mon, Feb 23, 2015 at 8:18 AM, Just Spam schlibos...@gmail.com wrote: Any reason? I remember in 3.6 the lock was removed/deleted? 2015-02-23 14:13 GMT+01:00 Robert Muir rcm...@gmail.com: It should not be deleted. Just don't mess

Re: Document Ordering

2015-02-16 Thread Robert Muir
Have a look at SortingMergePolicy: http://lucene.apache.org/core/4_10_0/misc/org/apache/lucene/index/sorter/SortingMergePolicy.html On Mon, Feb 16, 2015 at 9:47 PM, Elliott Bradshaw ebradsh...@gmail.com wrote: Hi, I'm interested in using Lucene to index binary objects with a specific

Re: A codec moment or pickle

2015-02-13 Thread Robert Muir
heh, i just don't think thats the typical case. Its definitely extreme. Even still, in many cases using the filesystem (properly warmed) with compression might still be better. It depends how you are measuring latency. storing your whole index in gigabytes of heap ram without any compression on a

Re: A codec moment or pickle

2015-02-12 Thread Robert Muir
Honestly i dont agree. I don't know what you are trying to do, but if you want file format backwards compat working, then you need a different FilterCodec to match each lucene codec. Otherwise your codec is broken from a back compat standpoint. Wrapping the latest is an antipattern here. On

Re: A codec moment or pickle

2015-02-12 Thread Robert Muir
On Thu, Feb 12, 2015 at 8:51 AM, Benson Margulies ben...@basistech.com wrote: On Thu, Feb 12, 2015 at 8:43 AM, Robert Muir rcm...@gmail.com wrote: Honestly i dont agree. I don't know what you are trying to do, but if you want file format backwards compat working, then you need a different

Re: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

2015-02-12 Thread Robert Muir
On Thu, Feb 12, 2015 at 11:58 AM, McKinley, James T james.mckin...@cengage.com wrote: Hi Robert, Thanks for responding to my message. Are you saying that you or others have encountered problems running Lucene 4.8+ on the 64-bit Java SE 1.7 JVM with G1 and was it on Windows or on Linux? If

Re: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8)

2015-02-11 Thread Robert Muir
, February 07, 2015 2:22 PM To: java-user Subject: Re: Lucene Version Upgrade (3-4) and Java JVM Versions(6-8) The G1C1 issue reference by Robert Muir on the Wiki page is at a Lucene level. Lucene, of course, is critically important to Solr so from that perspective it is about Solr too. https

Re: SegmentCommitInfos and live/deleted files

2015-01-11 Thread Robert Muir
files are either per-segment or per-commit. the first only returns per-segment files. this means it won't include any per-commit files: * segments_N itself * generational .liv for deletes * generational .fnm/.dvd/etc for docvalues updates. the second includes per-commit files, too. it doesnt

Re: manually merging Directories

2014-12-30 Thread Robert Muir
FYI there is more discussion on https://issues.apache.org/jira/browse/LUCENE-4746 In general, i don't like the idea that if things go wrong (which they will), that the input Directories would be left in a trashed state. To me, hard links would be the correct solution, but Files.createLink is an

Re: manually merging Directories

2014-12-30 Thread Robert Muir
good to keep this in mind for future reference. Thanks! Shaun From: Robert Muir rcm...@gmail.com Sent: December 30, 2014 9:36 AM To: java-user Subject: Re: manually merging Directories FYI there is more discussion on https://issues.apache.org

Re: Building non-core jar-files from lucene sources.

2014-12-02 Thread Robert Muir
If you run ant -p it will print targets and descriptions. you want 'ant compile'. In my opinion the default target should not be 'jar', but print this list of targets instead, just like the top-level build file. On Tue, Dec 2, 2014 at 12:09 PM, Badano Andrea andrea.bad...@sweco.se wrote:

Re: How to configure lucene 4.x to read 3.x index files

2014-09-23 Thread Robert Muir
You should not have to configure anything. The exception should not happen: can I have this index to debug the issue? On Mon, Sep 22, 2014 at 11:07 PM, Patrick Mi patrick...@touchpoint.co.nz wrote: Hi there, I understood that Lucene V4 could read 3.x index files by configuring Lucene3xCodec

Re: How to configure lucene 4.x to read 3.x index files

2014-09-23 Thread Robert Muir
I opened an issue with a patch for this: https://issues.apache.org/jira/browse/LUCENE-5975 Thanks for reporting it! On Mon, Sep 22, 2014 at 11:07 PM, Patrick Mi patrick...@touchpoint.co.nz wrote: Hi there, I understood that Lucene V4 could read 3.x index files by configuring Lucene3xCodec

Re: How to configure lucene 4.x to read 3.x index files

2014-09-23 Thread Robert Muir
indeed worked for the V3 but not V4.10. Maybe something in that index could cause problem in V4 but not v3. Also I have tried an earlier version v4.7 as Uwe suggested and V4.7 version works on the V3 index that V4.10 failed to open. Regards, Patrick -Original Message- From: Robert

Re: Insufficient system resources exist to complete the requested service

2014-09-15 Thread Robert Muir
SimpleFSDirectory doesn't use memory mapping. I'd check you dont have leaks of indexreaders or similar. This error happens in windows when it runs out of open file handles. On Mon, Sep 15, 2014 at 3:52 AM, Michael McCandless luc...@mikemccandless.com wrote: Maybe your OS is running out of total

Re: 4.10.0: java.lang.IllegalStateException: cannot write 3x SegmentInfo unless codec is Lucene3x (got: Lucene40)

2014-09-10 Thread Robert Muir
Ian, this looks terrible, thanks for reporting this. Is there any possible way I could have a copy of that working index to make it easier to reproduce? On Wed, Sep 10, 2014 at 7:01 AM, Ian Lea ian@gmail.com wrote: Hi On running a quick test after a handful of minor code changes to deal

Re: 4.10.0: java.lang.IllegalStateException: cannot write 3x SegmentInfo unless codec is Lucene3x (got: Lucene40)

2014-09-10 Thread Robert Muir
Ian, its a supported version. It wouldnt matter if its 4.0 alpha or beta anyway, because we support index back compat for those. In your case, its actually the final version. I will open an issue. Thank you for reporting this! On Wed, Sep 10, 2014 at 7:54 AM, Ian Lea ian@gmail.com wrote:

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-09-10 Thread Robert Muir
Thats because there are 3 constructors in segmentreader: 1. one used for opening new (checks hasDeletions, only reads liveDocs if so) 2. one used for non-NRT reopen -- problem one for you 3. one used for NRT reopen (takes a LiveDocs as a param, so no bug) so personally i think you should be able

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-09-10 Thread Robert Muir
point 1 DirectoryReader searchReader = DirectoryReader.openIfChanged(latest, ic1); // === Exception/Assertion thrown here On Wed, Sep 10, 2014 at 6:26 PM, Robert Muir rcm...@gmail.com wrote: Thats because there are 3 constructors in segmentreader: 1. one used for opening new (checks

Re: Snowball filter - Error instantiating stemmer for a language

2014-09-05 Thread Robert Muir
On Thu, Sep 4, 2014 at 7:32 PM, atawfik contact.txl...@gmail.com wrote: As you can see, the key here is calling the *inform* method of the SnowballProterFilterFactory to add the respected language stemmer's class. This is actually the task of the *inform* method. The standard constructor of

  1   2   3   4   5   6   >