Re: Using Lucene for technical documentation

2020-11-23 Thread Erick Erickson
You might be able to get something “good enough” with one of the pattern tokenizers, see: https://lucene.apache.org/solr/guide/8_6/tokenizers.html. Won’t be 100% of course. And Paul’s comments are well taken, especially since your input will be inconsistent I’d guess. How much you want to bet

Re: Lucene Migration Query

2020-11-22 Thread Erick Erickson
If you created your index with 7x, you don’t need to do anything, 8x will be able to operate with it. If you ever used 6x to index any docs you must reindex completely by deleting the entire index and starting over, or index to a new collection and use collection aliasing to seamlessly switch.

Re: Lucene Migration query

2020-11-20 Thread Erick Erickson
The IndexUpgraderTool does a forceMerge(1). If you have a large index, that has its own problems, but will work. The threshold for the issues is 5G. See: https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/ I should emphasize that if you have a very large single segment as a result,

Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-13 Thread Erick Erickson
13, 2020, at 12:16 AM, baris.ka...@oracle.com wrote: > > Hi,- > Thanks. > These are final finished sizes in both cases. > Best regards > > >> On Nov 12, 2020, at 11:12 PM, Erick Erickson wrote: >> >> Yes, that issue is fixed. The “Resolution” tag is the ke

Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-12 Thread Erick Erickson
Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked “fixed” and the version is 8.0 As for your other question, index size is a very imprecise number. How many deleted documents are there in each case? Deleted documents take up disk space until the segments containing them

Re: Which Lucene 8.5.X is recommended?

2020-11-12 Thread Erick Erickson
Always use the most recent point release. The only time we go from x.y.z to x.y.z+1 is if there are _significant_ problems. This is much different than going from x.y to x.y+1... > On Nov 12, 2020, at 5:49 PM, baris.ka...@oracle.com wrote: > > Hi,- > > is it best to use 8.5.2? > > Best

Re: BooleanQuery: BooleanClause.Occur.MUST_NOT seems to require at least one BooleanClause.Occur.MUST

2020-11-06 Thread Erick Erickson
Nissim: Here’s a good explanation of why it was designed this way if you’d like details: https://lucidworks.com/post/why-not-and-or-and-not/ Don’t be put off by the Solr title, it’s really about BooleanQuery and BooleanClause Best, Erick > On Nov 6, 2020, at 8:17 AM, Adrien Grand wrote: > >

Re: Deduplication of search result with custom with custom sort

2020-10-09 Thread Erick Erickson
out grouping, 0.6 sec with > grouping and 10 sec with setAllGroups(true) > > Thank you, Erick, I will look into it > > пт, 9 окт. 2020 г. в 14:32, Erick Erickson : > >> At the Solr level, CollapsingQParserPlugin see: >> https://lucene.apache.org/solr/guide/8_6/collapse-and

Re: Deduplication of search result with custom with custom sort

2020-10-09 Thread Erick Erickson
At the Solr level, CollapsingQParserPlugin see: https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html You could perhaps steal some ideas from that if you need this at the Lucene level. Best, Erick > On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON) > wrote:

Re: Exact sub-phrase matching?

2020-09-25 Thread Erick Erickson
Have you looked at edismax, pf2 and pf3? On Fri, Sep 25, 2020, 15:07 Gregg Donovan wrote: > Hello! > > I'm wondering what the state-of-the-art for matching exact sub phrases > within Lucene is. As a bonus, I'd love to attach a boost to each of the > subphrases matched (if possible). > > For

Re: I resurrected a 2013 project (Lucene 4.2) and I want to convert it to 8.6

2020-08-04 Thread Erick Erickson
Well, a _lot_ has changed since 4.x. Rather than look through the code, I’d start with the reference guide and the upgrade notes and major changes that accompany any release. As for “official dictionaries”, no there aren’t. “somewhere out on the web” there are certainly various word lists you

Re: Ulimit recommendation for Apache Lucene 6.5.1

2020-07-14 Thread Erick Erickson
ote: > > If you cache the IndexSearcher and only have a couple of segments, and it’s > a read only system (indexing is done just once), would it still open a lot > of files? > > On Tue, 14 Jul 2020 at 7:05 PM, Erick Erickson > wrote: > >> At least 65K. Yes,

Re: Ulimit recommendation for Apache Lucene 6.5.1

2020-07-14 Thread Erick Erickson
At least 65K. Yes, 65 thousand. Ditto for processes. > On Jul 14, 2020, at 8:35 AM, Archana A M wrote: > > Dear Team, > > We are getting "too many open files" in server while trying to access > apache Lucene cache. > > Could someone please suggest the recommended open file limit while using >

Re: Storing Json field in Lucene

2020-04-22 Thread Erick Erickson
"Is it good idea to store complete Json as string to Lucene DB. If we store as separate fields then we have around 30 fields. There will be 30 seeks to get complete stored fields” This is not true. Under the covers, all the stored fields are compressed and stored as a blob and Lucene does the

Re: Lucene download page

2020-02-23 Thread Erick Erickson
No, 7.7.2 was a patch fix that _was_ released after 8.1.1. > On Feb 22, 2020, at 2:49 PM, baris.ka...@oracle.com wrote: > > Hi,- > > i hope everyone is doing great. > > Licene 7.7.2 is listed as released after Lucene 8.1.1 is released on this > page >

Re: Searching number of tokens in text field

2019-12-30 Thread Erick Erickson
This comes up occasionally, it’d be a neat thing to add to Solr if you’re motivated. It gets tricky though. - part of the config would have to be the name of the length field to put the result into, that part’s easy. - The trickier part is “when should the count be incremented?”. For instance,

Re: Lucene index directory grows and shrinks

2019-11-04 Thread Erick Erickson
Here’s a neat visualization: http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html The short form is this: - A “segment” is all the files with a particular prefix in your index directory, e.g. _12ey1* is one segment - Segments are created as documents are indexed and

Re: Getting a MaxBytesLengthExceededException for a TextField

2019-10-25 Thread Erick Erickson
Text-based fields indeed do not have that limit for the _entire_ field. They _do_ have that limit for any single token produced. So if your field contains, say, a base-64 encoded image that is not broken up into smaller tokens, you’ll still get this error. Best, Erick > On Oct 25, 2019, at

Re: Classic QueryParser, StandardQueryParser, Quotes

2019-10-10 Thread Erick Erickson
1> Add =query to the query and look at the parsed query returned. That’ll tell you a _lot_ about this kind of question. 2> look at the analysis page of the admin UI for the core and see how your field definition handles the tokens once they’re through <1>. Best, Erick > On Oct 10, 2019, at

Re: Beginner Question: Tokenized and full phrase

2019-09-02 Thread Erick Erickson
In the Lucene context you simply have tokens. In the analyzed case (i.e. text), the token is however the incoming stream is split up by the analysis chain you construct. In the string case the token is the entire input. That’s just the way it works. You have two choices: 1> Use two fields,

Re: find documents with big stored fields

2019-07-01 Thread Erick Erickson
Whoa. First, it should be pretty easy to figure out what fields are large, just look at your input documents. The fdt files are really simple, they’re just the compressed raw data. Numeric fields, for instance, are just character data in the fdt files. We usually see about a 2:1 ratio. There’s

Re: explainOther SOLR concept?

2019-06-27 Thread Erick Erickson
It’s a Solr-only param for adding to debug=true…. > On Jun 27, 2019, at 12:11 PM, baris.ka...@oracle.com wrote: > > Hi,- > > is explainOther a SOLR concept/parameter? > > i think i can only find it in SOLR docs but not pure Lucene docs. > > Best regards > > >

Re: how to find out each score contribution from booleanquery components

2019-06-27 Thread Erick Erickson
BTW, if you have the ID of the doc you _think_ should be returned you can see why it wasn’t by using the explainOther parameter. > On Jun 27, 2019, at 8:11 AM, András Péteri > wrote: > > Hi Baris, > > Explanation's output is hierarchical, and the leading "0.0" values you > are seeing are the

Re: Index Optimization

2019-06-25 Thread Erick Erickson
Optimize is rarely useful. It can give some performance gains, but is quite an expensive operation. Pre Solr 7.5, optimizing had some behaviors that weren’t obvious, see: https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/ Post 7.5, the behavior has changed.

Re: A possible Java exception message fix

2019-06-24 Thread Erick Erickson
What are you asking here? Indeed, Lucene 8 (and therefore Solr) will not open an index that has ever been touched by Lucene 6x or earlier. You must re-index into 8x. You cannot spoof this with, for instance, IndexUpgraderTool and go from 6->7->8. You must reindex from your system-of-record.

Re: Live index upgrading

2019-06-21 Thread Erick Erickson
ent id to retrieve the original >> document text. >> >> Is there a convenient place to store the getLiveDocs index across process >> interruptions? Or should I use something stupid like a file to store the >> counter? >> >> That is still a lot of hass

Re: Live index upgrading

2019-06-21 Thread Erick Erickson
6 index, concurrently build Lucene7 index from > scratch, user Lucene6 index for search. > - When Lucene7 index is fully built, remove Lucene6 index and use Lucene7 > index for search. > > Rinse and repeat every major version. > > Really, isn't there something simpler

Re: Issue with lucene-core 3.4.0 and h2 database

2019-06-19 Thread Erick Erickson
Capitalization of the field name? > On Jun 19, 2019, at 8:40 AM, Robert Damian wrote: > > Hello, > > For a long time I've been using lucene as a search engine for my embedded > h2 databases. I am using lucene-core 3.4.0, which I know is pretty old and > h2 database 1.3.160. > > One issue that

Re: Live index upgrading

2019-06-17 Thread Erick Erickson
Let’s back up a bit. What version of Lucene are you using? Starting with Lucene 8, any index that’s ever been touched by Lucene 6 will not open. It does not matter if the index has been completely rewritten. It does not matter if it’s been run through IndexUpgraderTool, which just does a

Re: FuzzyQuery- why is it ignored?

2019-06-13 Thread Erick Erickson
Shot in the dark: stemming. Whenever I see a problem with something ending in “s” (or “er” or “ing” or….) my first suspect is that stemming is turned on. In that case the token in the index that’s actually searched on is somewhat different than you expect. The test is easy, just insure your

Re: IntField to IntPoint

2019-06-07 Thread Erick Erickson
d by any query > (term, boolean, range, phrase, prefix)? > > Il giorno mer 5 giu 2019 alle ore 17:41 Erick Erickson < > erickerick...@gmail.com> ha scritto: > >> You cannot upgrade more than one major version, you must re-index from >> scratch. There’s a long di

Re: IntField to IntPoint

2019-06-06 Thread Erick Erickson
> On Jun 5, 2019, at 2:07 PM, Riccardo Tasso wrote: > > > Considering that the IndexUpgrader will efficiently do the most of the work > I should investigate how to fill this gap, without reindexing from scratch. > > This is actually a problem. IndexUpgraderTool creates a single massive

Re: IntField to IntPoint

2019-06-05 Thread Erick Erickson
You cannot upgrade more than one major version, you must re-index from scratch. There’s a long discussion of why, but basically it’s summed up by this quote from Robert Muir: “I think the key issue here is Lucene is an index not a database. Because it is a lossy index and does not retain all

Re: About DuplicateFilter

2019-04-23 Thread Erick Erickson
How is the score being calculated? Because if it’s the usual scoring algorithm, there will be very few scores that are exactly identical. And the usual BM25 scores really don’t mean the documents are “similar”. This feels like an XY problem. How is “similarity” determined here? Best, Erick >

Re: Why does Lucene 7.4.0 commit() Increase Memory Usage x2

2019-04-02 Thread Erick Erickson
Task manager is almost useless for this kind of measurement. You never quite know how much garbage that hasn’t been collected is in that total. You can attach something like jconsole to the running Solr process and hit the “perform full GC” to get a more accurate number. Or you can look at

Re: Lucene migrate to 6.6.5 from 5.5.3

2019-03-28 Thread Erick Erickson
First I have to ask why not use something much more recent? 7.5 comes to mind. There’s not enough information here to say anything at all about what your problem might or might not be. “It doesn’t work” provides little to diagnose. You might want to review:

Re: Environmental Protection Agency: Stop Deforesting in Sri Lanka

2019-03-21 Thread Erick Erickson
This is an entirely inappropriate use of this list, do not do so again. > On Mar 21, 2019, at 12:06 AM, bjchathura...@gmail.com wrote: > > Hello there, > > I just signed the petition "Environmental Protection Agency: Stop > Deforesting in Sri Lanka" and wanted to see if you could help by adding

Re: Fetching 1000 documents taking around 30ms

2019-03-02 Thread Erick Erickson
“Is this expected” Yes. For each document, if there is any field with stored=true that does _not_ have docValues=true or is flagged as useDocValuesAsStored=false, there is 1> a disk seek to read the stored data from the fdt file 2> decompression of the data read in <1>, 16K block minimum. So

Re: Ignoring “de la” at index or search time

2019-02-24 Thread Erick Erickson
. Thanks. > > > > c b search string finds > > a b > > but how cant find > > a de la b > > so i will try french stopwords. > > Doing that i am using 8 queries like the ones i mentioned. > > Best > > > >> On Feb 24, 2019, at 1:19 PM, Eri

Re: Ignoring “de la” at index or search time

2019-02-24 Thread Erick Erickson
assumed. Best, Erick > On Feb 24, 2019, at 9:25 AM, baris.kazar wrote: > > i guess so > what is phrase search? > c b is searched do you expect a de la b? > Thanks > >> On Feb 24, 2019, at 10:49 AM, Erick Erickson wrote: >> >> Not sure we’re talking about the

Re: Ignoring “de la” at index or search time

2019-02-24 Thread Erick Erickson
cases with c b ~ which means find all containing c And b and c > Or b ( two separate queries having ~ ) > and then i can find a b but not a de la b without French stopwords. > Thanks > >> On Feb 23, 2019, at 6:52 PM, Erick Erickson wrote: >> >> Lucene won’t ign

Re: Ignoring “de la” at index or search time

2019-02-23 Thread Erick Erickson
; Thanks Erick there is a pattern i cant catch in my results such as: > a de la b > i catch “a b” though. > I though Lucene might ignore those automatically while creating index. > > >> On Feb 23, 2019, at 12:29 PM, Erick Erickson wrote: >> >> Use stopwords, although it's beco

Re: Ignoring “de la” at index or search time

2019-02-23 Thread Erick Erickson
Use stopwords, although it's becoming less of a concern, why do you think you need to? On Sat, Feb 23, 2019, 08:42 baris.kazar wrote: > Hi,- > What is the (most efficient) way to > ignore “de la” kinda connectors > in a string at index or search time? > Thanks > >

Re: Updating specific fields of huge docs

2019-02-13 Thread Erick Erickson
If (and only if) the fields you need to update are single-valued, docValues=true, indexed=false, you can do in-place update of the DV field only. Otherwise, you'll probably have to split the docs up. The question is whether you have evidence that reindexing is too expensive. If you do need to

Re: SynonymGraphFilter can't consume an incoming graph

2019-02-10 Thread Erick Erickson
It's, well, undefined. As in nobody knows except that it'll be wrong. And exactly what the results are may change with any given release. Best, Erick On Sun, Feb 10, 2019 at 10:48 AM lambda.coder lucene wrote: > > Hello, > > The Javadocs of SynonymGraphFilter says that it can’t consume an

Re: Manifoldcf2.10 - Sending user-defined fields to solr

2019-01-09 Thread Erick Erickson
You'd probably get more knowledgeable info from the Manifold folks, I don't know how many people on this list _also_ use Mainfold... Best, Erick On Wed, Jan 9, 2019 at 5:48 AM subasini wrote: > > Hi > I am using manifoldcf 2.10 and Solr 7.6.0. > I can crawl my website and indexing done in Solr

Re: is Document match Query

2018-12-17 Thread Erick Erickson
I'm not sure I understand, but why not just fire the queries off with an fq of the document ID? If you just need to know if any of N queries match the doc, you could check several at once with a big OR clause. Best, Erick On Mon, Dec 17, 2018 at 5:06 AM Valentin Popov wrote: > > Hello. > > I

Re: SearcherManager not seeing changes in IndexWriteral and

2018-11-09 Thread Erick Erickson
and have tests for) - I'll have one thread > write to the index, another (which starts after the first) search in it > and I'll create a bash script that runs the program until it fails (what > I currently do with our test). I'll do this beginning of next week. > > Thank you for

Re: SearcherManager not seeing changes in IndexWriteral and

2018-11-09 Thread Erick Erickson
If it's hard to do in a single thread, how about timestamping the events to insure that they happen in the expected order? That would verify the sequencing is happening as you expect and _still_ not see the expected docs... Assuming it does fail, do you think you could reduce it to a

Re: Static index, fastest way to do forceMerge

2018-11-03 Thread Erick Erickson
Do you really need exactly one segment? Or would, say, 5 be good enough? You see where this is going, set maxsegments to 5 and maybe be able to get some parallelization... On Fri, Nov 2, 2018, 14:17 Dawid Weiss Thanks for chipping in, Toke. A ~1TB index is impressive. > > Back of the envelope

Re: Static index, fastest way to do forceMerge

2018-11-02 Thread Erick Erickson
The merge process is rather tricky, and there's nothing that I know of that will use all resources available. In fact the merge code is written to _not_ use up all the possible resources on the theory that there should be some left over to handle queries etc. Yeah, the situation you describe is

Re: Lucene stops working

2018-11-02 Thread Erick Erickson
Is this custom code? What method? Can you show us a sample? There's not enough information here to say much. On Fri, Nov 2, 2018 at 7:38 AM egorlex wrote: > > Hi, I am new in Lucene and i have strange problem. Lucene stops working > without any errors after some time. It works fine for 1 day or

Re: Exception Details

2018-10-30 Thread Erick Erickson
No clue, org.compass.core is not part of Solr/Lucene, you'll have to ask the compass folks. Best, Erick On Tue, Oct 30, 2018 at 6:52 AM Veeraswami Pattapagalu wrote: > > Hi team, > > > We have got following exception while doing indexing , please share us the > information why we are getting

Re: Release the RAM

2018-10-25 Thread Erick Erickson
This really seems like an XY problem. What are you trying to accomplish that makes you want to use RAMDirectory at all? Why I'm asking: 1> RAMDirectory is quite special-purpose, very rarely is it something you should use 2> Java doesn't collect garbage when you close an object that references

Re: force deletes - terms enum still has deleted terms?

2018-09-28 Thread Erick Erickson
You might be hitting a rounding error. When this happens, how many deleted documents are there in the remaining segments? 1? The calculation for whether to merge the segment is: double pctDeletes = 100. * ((double) deleted_docs_in_segment / (double) doc_count_in_segment_including_deleted_docs if

Re: Running query against a single document

2018-09-21 Thread Erick Erickson
bq. We would like to know if there is a way to test a query against a document without creating an index. We were thinking that maybe we could use lucene highlighter component to achieve this, I don't really understand this at all. How are you using the highlighter component without creating an

Re: How to access DocValues inside a customized collector?

2018-09-20 Thread Erick Erickson
What Luke are you using? I think this one is being maintained: https://github.com/DmitryKey/luke The Terms component directly access the indexed data and can be used to poke around in the indexed data. I'll skip the accessing DocValues as I have to go back and look every time. On Thu, Sep 20,

Re: MultiPhraseQuery

2018-09-18 Thread Erick Erickson
bq. i wish the Javadocs has examples like PhraseQuery Javadocs gave. This is where someone coming into the examples for the first time is invaluable, javadoc patches are most welcome! It can be hard to back off enough to remember what the confusing bits are when you wrote the code ;) On Tue, Sep

Re: Subscribe to lucene user list

2018-09-17 Thread Erick Erickson
http://lucene.apache.org/solr/community.html#mailing-lists-irc On Mon, Sep 17, 2018 at 6:12 AM Anupam Rastogi wrote: > > Hi, > >I would like to subscribe to Lucene user list. >Thanks for all the help. > > Thanks, > Anupam Rastogi

Re: Any way to improve document fetching performance?

2018-08-28 Thread Erick Erickson
hing performance is also > important. On Tue, 28 Aug 2018 00:11:40 +0800 Erick Erickson > wrote Don't use that call. You're exactly > right, it goes out to disk, reads the doc, decompresses it (16K blocks > minimum per doc IIUC) all just to get the field. 2,000 in 50ms act

Re: Any way to improve document fetching performance?

2018-08-27 Thread Erick Erickson
Don't use that call. You're exactly right, it goes out to disk, reads the doc, decompresses it (16K blocks minimum per doc IIUC) all just to get the field. 2,000 in 50ms actually isn't bad for all that work ;). This sounds like an XY problem. You're asking how to speed up fetching docs, but not

Re: Question about usage of LuceneTestCase

2018-08-22 Thread Erick Erickson
;> >> My understanding at this point is (though it may be a repeat of your >> words,) >> first we should find out the combinations behind the failures. >> If there are any particular patterns, there could be bugs, so we should >> fix it. >> >> Thanks, >

Re: Question about usage of LuceneTestCase

2018-08-21 Thread Erick Erickson
e test framework, > I'm not familiar with it and still do not understand what does "seed" means > exactly in this context. > > Regards, > Tomoko > > 2018年8月22日(水) 1:05 Erick Erickson : > >> Couple of things (and I know you've been around for a while, so pardon >> me

Re: Question about usage of LuceneTestCase

2018-08-21 Thread Erick Erickson
Couple of things (and I know you've been around for a while, so pardon me if it's all old hat to you): 1> if you run the entire "reproduce with" line and can get a consistent failure, then you are half way there, nothing is as frustrating as not getting failures reliably. The critical bit is

Re: Question about threading in search

2018-08-17 Thread Erick Erickson
Please don't optimize to 1 segment unless you can afford to do it quite regularly, see: https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/ (NOTE: this is changing as of 7.5, see: https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/). bq. It

Re: testing with system properties

2018-08-09 Thread Erick Erickson
See TestSolrXml.java for an example of: @Rule public TestRule solrTestRules = RuleChain.outerRule(new SystemPropertiesRestoreRule()); Best, Erick On Thu, Aug 9, 2018 at 2:33 PM, Michael Sokolov wrote: > I ran into a need to test some functionality that relies on system > properties. Writing

Re: lucene index file gets corrupted while creating index with 2 nodes.

2018-07-31 Thread Erick Erickson
There is no chance anyone will try to change the code for 3.6, so raising a JIRA is pointless. see: http://lucene.472066.n3.nabble.com/Issues-with-locked-indices-td4339180.html Uwe is very knowledgeable in this area, so I'd strongly recommend you follow his advice. Best, Erick On Tue, Jul 31,

Re: WildcardQuery question

2018-07-23 Thread Erick Erickson
nks Eric, > > I see only Solr documents in there. My solution is 100% Lucene. > > Regards, > > Evert > > On Mon, Jul 23, 2018 at 7:56 PM Erick Erickson > wrote: > >> Take a look at ReverseWilcardFilterFactory: >> >> https://lucene.apache.org/solr/guid

Re: WildcardQuery question

2018-07-23 Thread Erick Erickson
Take a look at ReverseWilcardFilterFactory: https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html Best, Erick On Mon, Jul 23, 2018 at 7:53 AM, Evert Wagenaar wrote: > Hello all, > > I have a WebApp (see http://ejwagenaar.com/index.php/Lingoweb/) which makes > extensive use of

Re: Lucene stringField EnglishAnalyzer serch nor working

2018-07-20 Thread Erick Erickson
Why so complicated? Boosts do you no good, you're only trying to find one document. Boosts influence the score of documents in the ranking, but there's only one. I suspect if you looked at the debug form of the parsed query, you'd find it pretty unexpected. You say it works with text fields, but

Re: Lucene stringField EnglishAnalyzer serch nor working

2018-07-20 Thread Erick Erickson
Please provide specific examples of what you mean. along with the fieldType you tried, an example of what the input at index time for the field, and examples of what searches "didn't work". What exactly did you expect to happen that didn't? You might review:

Re: Grant Ingersoll's 2009 blog article- is there a newer version?

2018-07-05 Thread Erick Erickson
Maybe look at the Solr payload code to see how to do it in Lucene? But yeah, that article is quite out of date. On Thu, Jul 5, 2018 at 8:23 AM, wrote: > Thanks i saw these posts but Grant's article is based on Lucene. > > i am not using Solr. Many classes in that article does not exist in

Re: Size of Document

2018-07-04 Thread Erick Erickson
rested in the source data > document size, just real disk usage). > Thanks > > Chris > > Sent from my iPhone > >> On 4 Jul 2018, at 17:08, Erick Erickson wrote: >> >> But does size on disk help? If the doc has a zillion >> images in it, those aren't part of

Re: Size of Document

2018-07-04 Thread Erick Erickson
But does size on disk help? If the doc has a zillion images in it, those aren't part of the resulting index (I'm excluding stored data here) On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen wrote: > In the document types I usually index (.pdf, .docx/.doc, .eml), there > exists a metadata field

Re: lucene hadoop index

2018-06-11 Thread Erick Erickson
I think you're far more likely to find people who know the details on the Hadoop mailing list On Mon, Jun 11, 2018 at 2:13 AM, Yonghui Zhao wrote: > I found there was > "org.apache.hadoop.contrib.index.lucene.FileSystemDirectory" for lucene in > hadoop old version. > >

Re: Deletions in NRTCAchingDirectory

2018-05-14 Thread Erick Erickson
" I have a use case of bulk deletions with solr and want to understand using soft commits will help or not." Will help with what? You haven't told us what the problem you're worried about is. This might help:

Re: SortingMergePolicy is removed in 7.2.1?

2018-04-10 Thread Erick Erickson
I found it in .../solr/core/src/java/org/apache/solr/index/SortingMergePolicy.java and the associated factory in ,,,/solr/core/src/java/org/apache/solr/index/SortingMergePolicyFactory.java so I'm not sure what you're having trouble with Best, Erick On Tue, Apr 10, 2018 at 4:56 AM, Yonghui

Re: Storage of indexed and stored fields (Space and Performance)

2018-03-15 Thread Erick Erickson
Stored data is kept in separate segment files (*.fdt and *.fdx). As such they have no measurable impact on query time. All the data for executing searches is kept in other extensions in each segment and accessed separately. Adding stored data does increase the size on disk by roughly 50% of the

Re: getting Lucene Docid from inside score()

2018-03-10 Thread Erick Erickson
; System.out.println(hits[0].doc); // I want this docid > inside score() > >> If you still want to get the internal ID, just specify the >> pseudo-field [docid], as: "fl=id,[docid]" > > I didn't get your suggestion properly. Can you please explain a

Re: getting Lucene Docid from inside score()

2018-03-09 Thread Erick Erickson
You almost certainly do _not_ want this unless you are absolutely and totally sure that your index does not change between the time you ask for for the internal Lucene doc ID and the time you use it. No docs may be added. No forceMerges are done. In fact, I'd go so far as to say you shouldn't open

Re: [EXTERNAL] - Re: Is docvalue sorted by value?

2018-03-06 Thread Erick Erickson
nderstand your thinking that if the doc values are not persisted with doc > id sequence, it is unable to retrieve field value by doc id. > > Actually, I am just wondering how lucene handle the sorting scenario, is > iterating all values of all docs unavoidable? > > > On 3/6/1

Re: Is docvalue sorted by value?

2018-03-05 Thread Erick Erickson
I think there are two issues here that are being conflated 1> _within_ a document, i.e. for a multi-valued field the values are stored as Dominik says as a SORTED_SET. Not only will they be returned (if you return from docValues rather than stored) in lexical order, but identical values will be

Re: Custom Similarity

2018-02-08 Thread Erick Erickson
As of Solr 6.6, payload support has been added to Solr, see: SOLR-1485. Before that, it was much more difficult, see: https://lucidworks.com/2014/06/13/end-to-end-payload-example-in-solr/ Best, Erick On Thu, Feb 8, 2018 at 8:36 AM, Ahmet Arslan wrote: > > > Hi Roy, >

Re: indexing performance 6.6 vs 7.1

2018-01-18 Thread Erick Erickson
Robert: Ah, right. I keep confusing my gmail lists "lucene dev" and "lucene list" Siiih. On Thu, Jan 18, 2018 at 9:18 AM, Adrien Grand wrote: > If you have sparse data, I would have expected index time to *decrease*, > not increase. > > Can you enable the IW

Re: indexing performance 6.6 vs 7.1

2018-01-18 Thread Erick Erickson
My first question is always "are you running the Solr CPUs flat out?". My guess in this case is that the indexing client is the same and the problem is in Solr, but it's worth checking whether the clients are just somehow not delivering docs as fast as they were before. My suspicion is that the

Re: Maven snapshots

2018-01-09 Thread Erick Erickson
Maven support is not officially part of the project, it's maintained on a "when someone interested gets to it" basis. So the short answer is "no, you shouldn't expect those to be absolutely current" contributions welcome ;) Best, Erick On Tue, Jan 9, 2018 at 6:36 AM, Armins Stepanjans

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-02 Thread Erick Erickson
Luke has some capabilities to look at the index at a low level, perhaps that could give you some pointers. I think you can pull the older branch from here: https://github.com/DmitryKey/luke or: https://code.google.com/archive/p/luke/ NOTE: This is not a part of Lucene, but an independent project

Re: solr 7.0: What causes the segment to flush

2017-10-17 Thread Erick Erickson
bq: Is there a way to not write to disk continuously and only write the file... Not if we're talking about the transaction log. The design is for the transaction log in particular to continuously get updates flushed to it, otherwise you could not replay the transaction log upon restart and have

Re: run in eclipse error

2017-10-17 Thread Erick Erickson
Anyone can raise a JIRA and submit a patch, it's then up to one of the committers to pick it up and commit to the code lines. You have to create an ID of course. See: https://issues.apache.org/jira/ On Tue, Oct 17, 2017 at 5:04 AM, Mike Sokolov wrote: > Checkstyle has a

Re: run in eclipse error

2017-10-16 Thread Erick Erickson
bq: Does git master need to use java9 for development i can at least answer that with "no". Java8 is the current standard for master. No clue what's going on with Eclipse though, I use IntelliJ That class is part of Solr so Java 9 is probably not germane. Best, Erick On Mon, Oct 16, 2017

Re: Custom Query & reading plongs used by a custom Scorer

2017-10-06 Thread Erick Erickson
docValues are the first thing I'd look at. What you've done is an anit-pattern for scoring because it reads the stored data from disk and decompress it to read the value; as you say costly. Getting it from a docValues field, OTOH, will read the value(s) directly from MMapDirectory space, i.e. the

Re: solr7.0.1: TestControlledRealTimeReopenThread stalled forever

2017-10-03 Thread Erick Erickson
Whew! Thanks for letting us know. Erick On Tue, Oct 3, 2017 at 1:12 PM, Nawab Zada Asad Iqbal wrote: > Actually, it seems that one of my local changes is causing the halting > issue. I am debugging it now. Sorry for noise. > > On Tue, Oct 3, 2017 at 12:08 PM, Nawab Zada Asad

Re: Still using lucene 2.3, is compatible with java 8?

2017-09-16 Thread Erick Erickson
I doubt anyone has tested it. I'd compile it under Java 8 and see if all of the tests run. Best, Erick On Sat, Sep 16, 2017 at 7:41 AM, Lisheng Zhang wrote: > Hi, in one of our product we are still using lucene 2.3, is lucene 2.3 > compatible with java 1.8? > > Thanks very

Re: Need to unsub from lucene groups.

2017-09-10 Thread Erick Erickson
See: http://lucene.apache.org/solr/community.html, the "unsubscribe" section. If you have problems, look at the "Problems" link. Note, you _must_ use the exact same e-mail you originally subscribed with. Best, Erick On Sun, Sep 10, 2017 at 8:51 AM, Khurram Shehzad

Re: Encryption at lucene index

2017-08-11 Thread Erick Erickson
e fields of an document which has personal >> > identifiable information ( both indexed and stored data)... for eg: >> email, >> > mobilenumber etc.. i am able to find LUCENE-6966 alone while googling >> it.. >> > any related pointers in solr or latest lucene version? >

Re: Encryption at lucene index

2017-08-07 Thread Erick Erickson
M, Kumaran Ramasubramanian <kums@gmail.com> wrote: > Hi Erick, > > Thanks for the information. Any pointers about encryption options in > solr? > > > -- > Kumaran R > > > > On Mon, Aug 7, 2017 at 9:17 PM, Erick Erickson <erickerick...@gmail.com> > wrote:

Re: Encryption at lucene index

2017-08-07 Thread Erick Erickson
Encryption in Solr has a bunch of ramifications. Do you care about - encryption at rest or in memory? - encrypting the _searchable_ tokens? - encrypting the searchable tokens per-user? - encrypting the stored data (which a filter won't do BTW). It's actually a fairly complex topic the discussion

Re: 答复: local variable name question

2017-08-06 Thread Erick Erickson
Usually there's no very good reason, it's just with a bunch of people with more or lest time and more or less pressure and different habits choose variable names that reflect how they're thinking about the issue at the time. Generally when working on a bit of code if the names are confusing

Re: Lucene 6.6: "Too many open files"

2017-07-31 Thread Erick Erickson
No, nothing's changed fundamentally. But you say: "We have some batch indexing scripts, which flood the solr servers with indexing requests (while keeping open-searcher false)" What is your commit interval? Regardless of whether openSearcher is false or not, background merging continues apace

Re: Maintaining sorting order (stored fields vs DocValue fields) while upgrading Lucene version

2017-06-29 Thread Erick Erickson
1> Is it correct that stored fields can only be sorted on if they become a DocValue field in 5.x no. Indexed-only fields can still be used to sort. DocValues are just more efficient at load time and don't consume as much of the Java heap. Essentially this latter can be thought of as moving the

  1   2   3   4   5   6   7   8   9   10   >