Re: Congratulations to the new Apache Solr PMC Chair, Jan Høydahl!

2021-02-18 Thread Michael McCandless
Congratulations and thank you, Jan! It is so exciting that Solr is now a TLP! Mike McCandless http://blog.mikemccandless.com On Thu, Feb 18, 2021 at 1:56 PM Anshum Gupta wrote: > Hi everyone, > > I’d like to inform everyone that the newly formed Apache Solr PMC > nominated and elected Jan

Re: ExecutorService support in SolrIndexSearcher

2019-08-31 Thread Michael McCandless
We pass ExecutorService to Lucene's IndexSearcher at Amazon (for customer facing product search) and it's a big win on long-pole query latencies, but hurts red-line QPS (cluster capacity) a bit, due to less efficient collection across segments and thread context switching. I'm surprised it's not

Re: Mistake assert tips in FST builder ?

2019-04-22 Thread Michael McCandless
Hello, Indeed, you cosmetic fix looks great -- I'll push that change. Thanks for noticing and raising! Mike McCandless http://blog.mikemccandless.com On Tue, Apr 16, 2019 at 12:04 AM zhenyuan wei wrote: > Hi, >With current newest version, 9.0.0-snapshot,In >

Re: Long blocking during indexing + deleteByQuery

2017-11-08 Thread Michael McCandless
I'm not sure this is what's affecting you, but you might try upgrading to Lucene/Solr 7.1; in 7.0 there were big improvements in using multiple threads to resolve deletions: http://blog.mikemccandless.com/2017/07/lucene-gets-concurrent-deletes-and.html Mike McCandless

Re: SOLR-11504: Provide a config to restrict number of indexing threads

2017-11-02 Thread Michael McCandless
Actually, it's one lucene segment per *concurrent* indexing thread. So if you have 10 indexing threads in Lucene at once, then 10 in-memory segments will be created and will have to be written on refresh/commit. Elasticsearch uses a bounded thread pool to service all indexing requests, which I

Re: StandardDirectoryReader.java:: applyAllDeletes, writeAllDeletes

2017-05-29 Thread Michael McCandless
t; I am investigating the question, if this change is still needed in 6.5.1 >> or can this be achieved by any other configuration? >> >> For now, we are not planning to use NRT and solrCloud. >> >> >> Thanks >> Nawab >> >> On Sun, May 28, 20

Re: StandardDirectoryReader.java:: applyAllDeletes, writeAllDeletes

2017-05-28 Thread Michael McCandless
Sorry, yes, that commit was one of many on a feature branch I used to work on LUCENE-5438, which added near-real-time index replication to Lucene. Before this change, Lucene's replication module required a commit in order to replicate, which is a heavy operation. The writeAllDeletes boolean

Re: AnalyzingInfixSuggester performance

2017-04-18 Thread Michael McCandless
zingInfixSuggester instead of a regular Solr index (since both are > using standard Lucene?) is that the AInfixSuggester does sorting at > index-time using the weightField? So it's only ever advantageous to use > this Suggester if you need sorting based on a field? > > Thanks >

Re: AnalyzingInfixSuggester performance

2017-04-18 Thread Michael McCandless
AnalyzingInfixSuggester uses index-time sort, to sort all postings by the suggest weight, so that lookup, as long as your sort by the suggest weight is extremely fast. But if you need to rank at lookup time by something not "congruent" with the index-time sort then you lose that benefit. Mike

Re: Is there a way to tell if multivalued field actually contains multiple values?

2016-11-11 Thread Michael McCandless
I think you can use the term stats that Lucene tracks for each field. Compare Terms.getSumTotalTermFreq and Terms.getDocCount. If they are equal it means every document that had this field, had only one token. Mike McCandless http://blog.mikemccandless.com On Fri, Nov 11, 2016 at 5:50 AM,

[ANNOUNCE] Apache Solr 6.2.0 released

2016-08-26 Thread Michael McCandless
26 August 2016, Apache Solr 6.2.0 available Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search and analytics, rich document parsing, geospatial search,

Re: ConcurrentMergeScheduler options not exposed

2016-06-17 Thread Michael McCandless
awn Heisey <apa...@elyograg.org> wrote: > On 6/16/2016 2:35 AM, Michael McCandless wrote: > > > > Hmm, merging can't read at 800 MB/sec and only write at 20 MB/sec for > > very long ... unless there is a huge percentage of deletes. Also, by > > default CMS does

Re: ConcurrentMergeScheduler options not exposed

2016-06-16 Thread Michael McCandless
Hmm, merging can't read at 800 MB/sec and only write at 20 MB/sec for very long ... unless there is a huge percentage of deletes. Also, by default CMS doesn't throttle forced merges (see CMS.get/setForceMergeMBPerSec). Maybe capture IndexWriter.setInfoStream output? Mike McCandless

Re: Lucene/Solr Git Mirrors 5 day lag behind SVN?

2015-10-24 Thread Michael McCandless
I added a comment on the INFRA issue. I don't understand why it periodically "gets stuck". Mike McCandless http://blog.mikemccandless.com On Fri, Oct 23, 2015 at 11:27 AM, Kevin Risden wrote: > It looks like both Apache Git mirror

Re: CheckIndex failed for Solr 4.7.2 index

2015-06-09 Thread Michael McCandless
IBM's J9 JVM unfortunately still has a number of nasty bugs affecting Lucene; most likely you are hitting one of these. We used to test J9 in our continuous Jenkins jobs, but there were just too many J9-specific failures and we couldn't get IBM's attention to resolve them, so we stopped. For now

[ANNOUNCE] Apache Solr 4.10.4 released

2015-03-05 Thread Michael McCandless
October 2014, Apache Solr™ 4.10.4 available The Lucene PMC is pleased to announce the release of Apache Solr 4.10.4 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted

Re: Frequent deletions

2015-01-01 Thread Michael McCandless
Also see this G+ post I wrote up recently showing how %tg deletions changes over time for an every add also deletes a previous document stress test: https://plus.google.com/112759599082866346694/posts/MJVueTznYnD Mike McCandless http://blog.mikemccandless.com On Wed, Dec 31, 2014 at 12:21 PM,

Re: DocsEnum and TermsEnum reuse in lucene join library?

2014-12-06 Thread Michael McCandless
They should be reused if the impl. allows for it. Besides reducing GC cost, it can also be a sizable performance gain since these enums can have quite a bit of state that otherwise must be re-initialized. If you really don't want to reuse them (force a new enum every time), pass null. Mike

[ANNOUNCE] Apache Solr 4.10.2 released

2014-10-31 Thread Michael McCandless
October 2014, Apache Solr™ 4.10.2 available The Lucene PMC is pleased to announce the release of Apache Solr 4.10.2 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted

[ANNOUNCE] Apache Solr 4.10.1 released

2014-09-29 Thread Michael McCandless
September 2014, Apache Solr™ 4.10.1 available The Lucene PMC is pleased to announce the release of Apache Solr 4.10.1 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting,

[ANNOUNCE] Apache Solr 4.9.1 released

2014-09-22 Thread Michael McCandless
September 2014, Apache Solr™ 4.9.1 available The Lucene PMC is pleased to announce the release of Apache Solr 4.9.1 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted

Re: [ANNOUNCE] Apache Solr 4.9.1 released

2014-09-22 Thread Michael McCandless
release that is critical to the RM (Michael McCandless) and/or an organization where he has influence or liability. Apparently this was a more expedient path than completely validating a 4.10 upgrade and waiting for the 4.10.1 bugfix release. Validating the 4.10 upgrade probably would have taken

Re: optimize and .nfsXXXX files

2014-08-18 Thread Michael McCandless
Soft commit (i.e. opening a new IndexReader in Lucene and closing the old one) should make those go away? The .nfsX files are created when a file is deleted but a local process (in this case, the current Lucene IndexReader) still has the file open. Mike McCandless

Re: Does lucene uses tries?

2014-06-05 Thread Michael McCandless
The default terms dictionary (BlockTree) also uses a trie index structure to locate the block on disk that may contain a target term. Mike McCandless http://blog.mikemccandless.com On Thu, Jun 5, 2014 at 12:11 PM, Shawn Heisey s...@elyograg.org wrote: I just have want know that does the

Re: Expected date of release for Solr 4.7.1

2014-03-29 Thread Michael McCandless
RC2 is being voted on now ... so it should be soon (a few days, but more if any new blocker issues are found and we need to do RC3). Mike McCandless http://blog.mikemccandless.com On Sat, Mar 29, 2014 at 2:26 PM, Puneet Pawaia puneet.paw...@gmail.com wrote: Hi Any idea on the expected date

Re: Enabling other SimpleText formats besides postings

2014-03-28 Thread Michael McCandless
You told the fieldType to use SimpleText only for the postings, not all other parts of the codec (doc values, live docs, stored fields, etc...), and so it used the default codec for those components. If instead you used the SimpleTextCodec (not sure how to specify this in Solr's schema.xml) then

Re: AutoSuggest like Google in Solr using Solarium Client.

2014-03-17 Thread Michael McCandless
I think it's best to use one of the many autosuggesters Lucene/Solr provide? E.g. AnalyzingInfixSuggester is running here: http://jirasearch.mikemccandless.com But that's just one suggester... there are many more. Mike McCandless http://blog.mikemccandless.com On Mon, Mar 17, 2014 at 10:44

Re: Join Scoring

2014-02-13 Thread Michael McCandless
I suspect (not certain) one reason for the performance difference with Solr vs Lucene joins is that Solr operates on a top-level reader? This results in fast joins, but it means whenever you open a new reader (NRT reader) there is a high cost to regenerate the top-level data structures. But if

Re: Lucene Join

2014-01-30 Thread Michael McCandless
Look in lucene's join module? Mike McCandless http://blog.mikemccandless.com On Thu, Jan 30, 2014 at 4:15 AM, anand chandak anand.chan...@oracle.com wrote: Hi, I am trying to find whether the lucene joins (not solr join) if they are using any filter cache. The API that lucene uses is for

Re: background merge hit exception while optimizing index (SOLR 4.4.0)

2014-01-13 Thread Michael McCandless
Which version of Java are you using? That root cause exception is somewhat spooky: it's in the ByteBufferIndexCode that handles an UnderflowException, ie when a small (maybe a few hundred bytes) read happens to span the 1 GB page boundary, and specifically the exception happens on the final read

Re: background merge hit exception while optimizing index (SOLR 4.4.0)

2014-01-13 Thread Michael McCandless
I have trouble understanding J9's version strings ... but, is it really from 2008? You could be hitting a JVM bug; can you test upgrading? I don't have much experience with Solr faceting on optimized vs unoptimized indices; maybe someone else can answer your question. Lucene's facet module (not

Re: MergePolicy for append-only indices?

2014-01-08 Thread Michael McCandless
On Mon, Jan 6, 2014 at 3:42 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: I think the key optimization when there are no deletions is that you don't need to renumber documents and can bulk-copy blocks of contiguous documents, and that is independent of merge policy. I think :)

Re: Possible memory leak after segment merge? (related to DocValues?)

2013-12-31 Thread Michael McCandless
On Mon, Dec 30, 2013 at 1:22 PM, Greg Preston gpres...@marinsoftware.com wrote: That was it. Setting omitNorms=true on all fields fixed my problem. I left it indexing all weekend, and heap usage still looks great. Good! I'm still not clear why bouncing the solr instance freed up memory,

Re: Possible memory leak after segment merge? (related to DocValues?)

2013-12-27 Thread Michael McCandless
Likely this is for field norms, which use doc values under the hood. Mike McCandless http://blog.mikemccandless.com On Thu, Dec 26, 2013 at 5:03 PM, Greg Preston gpres...@marinsoftware.com wrote: Does anybody with knowledge of solr internals know why I'm seeing instances of

Re: Problems with gaps removed with SynonymFilter

2013-09-23 Thread Michael McCandless
Unfortunately the current SynonymFilter cannot handle posInc != 1 ... we could perhaps try to fix this ... patches welcome :) So for now it's best to place SynonymFilter before StopFilter, and before any other filters that may create graph tokens (posLen 1, posInc == 0). Mike McCandless

Re: Why solr 4.0 use FSIndexOutput to write file, otherwise MMap/NIO

2013-06-28 Thread Michael McCandless
Output is quite a bit simpler than input because all we do is write a single stream of bytes with no seeking (append only), and it's done with only one thread, so I don't think there'd be much to gain by using the newer IO APIs for writing... Mike McCandless http://blog.mikemccandless.com On

Re: TieredMergePolicy reclaimDeletesWeight

2013-06-19 Thread Michael McCandless
The default is 2.0, and higher values will more strongly favor merging segments with deletes. I think 20.0 is likely way too high ... maybe try 3-5? Mike McCandless http://blog.mikemccandless.com On Tue, Jun 18, 2013 at 6:46 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: Hi

Re: TieredMergePolicy reclaimDeletesWeight

2013-06-19 Thread Michael McCandless
19, 2013 at 1:36 PM, Petersen, Robert robert.peter...@mail.rakuten.com wrote: OK thanks, will do. Just out of curiosity, what would having that set way too high do? Would the index become fragmented or what? -Original Message- From: Michael McCandless [mailto:luc

Re: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-06-15 Thread Michael McCandless
You could also try the new[ish] PostingsHighlighter: http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html Mike McCandless http://blog.mikemccandless.com On Sat, Jun 15, 2013 at 8:50 AM, Michael Sokolov msoko...@safaribooksonline.com wrote: If you have very large

Re: How to recover from Error opening new searcher when machine crashed while indexing

2013-05-01 Thread Michael McCandless
Alas I think CheckIndex can't do much here: there is no segments file, so you'll have to reindex from scratch. Just to check: did you ever called commit while building the index before the machine crashed? Mike McCandless http://blog.mikemccandless.com On Tue, Apr 30, 2013 at 8:17 PM, Otis

Re: Bloom filters and optimized vs. unoptimized indices

2013-04-30 Thread Michael McCandless
Be sure to test the bloom postings format on your own use case ... in my tests (heavy PK lookups) it was slower. But to answer your question: I would expect a single segment index to have much faster PK lookups than a multi-segment one, with and without the bloom postings format, but bloom may

Re: Document adds, deletes, and commits ... a question about visibility.

2013-04-15 Thread Michael McCandless
At the Lucene level, you don't have to commit before doing the deleteByQuery, i.e. 'a' will be correctly deleted without any intervening commit. Mike McCandless http://blog.mikemccandless.com On Mon, Apr 15, 2013 at 3:57 PM, Shawn Heisey s...@elyograg.org wrote: Simple question first: Is there

Re: Is Lucene's DrillSideways something suitable for Solr?

2013-03-13 Thread Michael McCandless
On Tue, Mar 12, 2013 at 11:24 PM, Yonik Seeley yo...@lucidworks.com wrote: On Tue, Mar 12, 2013 at 10:27 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Lucene seems to get a new DrillSideways functionality on top of its own facet implementation. I would love to have something like that

Re: AW: 170G index, 1.5 billion documents, out of memory on query

2013-02-26 Thread Michael McCandless
It really should be unlimited: this setting has nothing to do with how much RAM is on the computer. See http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Mike McCandless http://blog.mikemccandless.com On Tue, Feb 26, 2013 at 12:18 PM, zqzuk ziqizh...@hotmail.co.uk wrote:

Re: AnalyzingSuggester returning index value instead of field value?

2013-02-07 Thread Michael McCandless
I'm not very familiar with how AnalyzingSuggester works inside Solr ... if you try this directly with the Lucene APIs does it still happen? Hmm maybe one idea: if you remove whitespace from your suggestion does it work? I wonder if there's a whitespace / multi-token issue ... if so then maybe

Re: get a list of terms sorted by total term frequency

2012-11-07 Thread Michael McCandless
Lucene's misc module has HighFreqTerms tool. Mike McCandless http://blog.mikemccandless.com On Wed, Nov 7, 2012 at 1:15 PM, Edward Garrett heacu.mcint...@gmail.com wrote: hi, is there a simple way to get a list of all terms that occur in a field sorted by their total term frequency within

Re: throttle segment merging

2012-10-29 Thread Michael McCandless
With Lucene 4.0, FSDirectory now supports merge bytes/sec throttling (FSDirectory.setMaxMergeWriteMBPerSec): it rate limits that max bytes/sec load on the IO system due to merging. Not sure if it's been exposed in Solr / ElasticSearch yet ... Mike McCandless http://blog.mikemccandless.com On

Re: Indexing in Solr: invalid UTF-8

2012-09-26 Thread Michael McCandless
Python's unicode function takes an optional (keyword) errors argument, telling it what to do when an invalid UTF8 byte sequence is seen. The default (errors='strict') is to throw the exceptions you're seeing. But you can also pass errors='replace' or errors='ignore'. See

Re: offsets issues with multiword synonyms since LUCENE_33

2012-08-14 Thread Michael McCandless
See also SOLR-3390. Some cases have been addressed. Eg, if you match domain name system - dns, then dns will have correct offsets spanning the full phrase domain name system in the input. (However: QueryParser won't work because a query for domain name system is pre-split on whitespace so the

Re: synonym file

2012-08-03 Thread Michael McCandless
Actually FST (and SynFilter based on it) was backported to 3.x. Mike McCandless http://blog.mikemccandless.com On Fri, Aug 3, 2012 at 11:28 AM, Jack Krupansky j...@basetechnology.com wrote: The Lucene FST guys made a big improvement in synonym filtering in Lucene/Solr 4.0 using FSTs. Or are

Re: Near Real Time Indexing and Searching with solr 3.6

2012-07-03 Thread Michael McCandless
Hi, You might want to take a look at Solr's trunk (very soon to be 4.0.0 alpha release), which already has a near-real-time solution (using Lucene's near-real-time APIs). Lucene has NRTCachingDirectory (to use RAM for small / recently flushed segments), but I don't think Solr uses it yet. Mike

Re: leap second bug

2012-07-01 Thread Michael McCandless
Looks like this is a low-level Linux issue ... see Shay's email to the ElasticSearch list about it: https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/_I1_OfaL7QY Also see the comments here: http://news.ycombinator.com/item?id=4182642 Mike McCandless

Re: Exception when optimizing index

2012-06-18 Thread Michael McCandless
Is it possible the Linux machine has bad RAM / bad disk? Mike McCandless http://blog.mikemccandless.com On Mon, Jun 18, 2012 at 7:06 AM, Erick Erickson erickerick...@gmail.com wrote: Is it possible that you somehow have some problem with jars and classpath? I'm wondering because this problem

Re: field name was indexed without position data; cannot run PhraseQuery (term=a)

2012-05-24 Thread Michael McCandless
This behavior has changed. In 3.x, you silently got no results in such cases. In trunk, you get an exception notifying you that the query cannot run. Mike McCandless http://blog.mikemccandless.com On Thu, May 24, 2012 at 6:04 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, What is

Re: field name was indexed without position data; cannot run PhraseQuery (term=a)

2012-05-24 Thread Michael McCandless
I believe termPositions=false refers to the term vectors and not how the field is indexed (which is very confusing I think...). I think you'll need to index a separate field disabling term freqs + positions than the field the queryparser can query? But ... if all of this is to just do custom

Re: question about NRT(soft commit) and Transaction Log in trunk

2012-05-06 Thread Michael McCandless
This is a good question... I don't know much about how Solr's transaction log works, but, peeking in the code, I do see it fsync'ing (look in TransactionLog.java, in the finish method), but only if the SyncLevel is FSYNC. If the default is really flush, I don't see how the transaction log helps

Re: SOLR 3.5 Index Optimization not producing single .cfs file

2012-05-03 Thread Michael McCandless
By default, the default merge policy (TieredMergePolicy) won't create the CFS if the segment is very large ( 10% of the total index size). Likely that's what you are seeing? If you really must have a CFS (how come?) then you can call TieredMergePolicy.setNOCFSRatio(1.0) -- not sure how/where

Re: Large Index and OutOfMemoryError: Map failed

2012-04-22 Thread Michael McCandless
(Final) KERNEL NAME: 2.6.18-128.el5 UPTIME: up 71 days LOAD AVERAGE: 1.42, 1.45, 1.53 JBOSS Version: Implementation-Version: 4.2.2.GA (build: SVNTag=JBoss_4_2_2_GA date=20 JAVA Version: java version 1.6.0_24 On Thu, Apr 12, 2012 at 3:07 AM, Michael McCandless luc

Re: Large Index and OutOfMemoryError: Map failed

2012-04-12 Thread Michael McCandless
] -- From: Michael McCandless luc...@mikemccandless.com Date: Sat, Mar 31, 2012 at 3:15 AM To: solr-user@lucene.apache.org It's the virtual memory limit that matters; yours says unlimited below (good!), but, are you certain that's really the limit your Solr process runs with? On Linux

Re: codecs for sorted indexes

2012-04-12 Thread Michael McCandless
Do you mean you are pre-sorting the documents (by what criteria?) yourself, before adding them to the index? In which case... you should already be seeing some benefits (smaller index size) than had you randomly added them (ie the vInts should take fewer bytes), I think. (Probably the savings

Re: Large Index and OutOfMemoryError: Map failed

2012-04-11 Thread Michael McCandless
, Michael McCandless luc...@mikemccandless.com wrote: It's the virtual memory limit that matters; yours says unlimited below (good!), but, are you certain that's really the limit your Solr process runs with? On Linux, there is also a per-process map count:    cat /proc/sys/vm/max_map_count

Re: Virtual Memory very high

2012-04-02 Thread Michael McCandless
Are you seeing a real problem here, besides just being alarmed by the big numbers from top? Consumption of virtual memory by itself is basically harmless, as long as you're not running up against any of the OS limits (and, you're running a 64 bit JVM). This is just top telling you that you've

Re: Open deleted index file failing jboss shutdown with Too many open files Error

2012-04-02 Thread Michael McCandless
Hmm, unless the ulimits are low, or the default mergeFactor was changed, or you have many indexes open in a single JVM, or you keep too many IndexReaders open, even in an NRT or frequent commit use case, you should not run out of file descriptors. Frequent commit/reopen should be perfectly fine,

Re: Large Index and OutOfMemoryError: Map failed

2012-03-31 Thread Michael McCandless
It's the virtual memory limit that matters; yours says unlimited below (good!), but, are you certain that's really the limit your Solr process runs with? On Linux, there is also a per-process map count: cat /proc/sys/vm/max_map_count I think it typically defaults to 65,536 but you should

Re: effect of continuous deletes on index's read performance

2012-02-06 Thread Michael McCandless
On Mon, Feb 6, 2012 at 8:20 AM, prasenjit mukherjee prasen@gmail.com wrote: Pardon my ignorance, Why can't the IndexWriter and IndexSearcher share the same underlying in-memory datastructure so that IndexSearcher need not be reopened with every commit. Because the semantics of an

Re: LUCENE-995 in 3.x

2012-01-05 Thread Michael McCandless
Thank you Ingo! I think post the 3.x patch directly on the issue? I'm not sure why this wasn't backported to 3.x the first time around... Mike McCandless http://blog.mikemccandless.com On Thu, Jan 5, 2012 at 8:15 AM, Ingo Renner i...@typo3.org wrote: Hi all, I've backported LUCENE-995 to

Re: LUCENE-995 in 3.x

2012-01-05 Thread Michael McCandless
Awesome, thanks Ingo... I'll have a look! Mike McCandless http://blog.mikemccandless.com On Thu, Jan 5, 2012 at 9:23 AM, Ingo Renner i...@typo3.org wrote: Am 05.01.2012 um 15:05 schrieb Michael McCandless: Thank you Ingo! I think post the 3.x patch directly on the issue? thanks

Re: help no segment in my lucene index!!!

2011-11-28 Thread Michael McCandless
Which version of Solr/Lucene were you using when you hit power loss? There was a known bug that could allow power loss to cause corruption, but this was fixed in Lucene 3.4.0. Unfortunately, there is no easy way to recreate the segments_N file... in principle it should be possible and maybe not

Re: help no segment in my lucene index!!!

2011-11-28 Thread Michael McCandless
On Mon, Nov 28, 2011 at 10:49 AM, Roberto Iannone iann...@crmpa.unisa.it wrote: Hi Michael, thx for your help :) You're welcome! 2011/11/28 Michael McCandless luc...@mikemccandless.com Which version of Solr/Lucene were you using when you hit power loss? I'm using Lucene 3.4. Hmm, which

Re: Parent-child options

2011-11-08 Thread Michael McCandless
Lucene itself has BlockJoinQuery/Collector (in contrib/join), which is what ElasticSearch is using under the hood for its nested documents (I think?). But I don't think this has been exposed in Solr yet patches welcome! Mike McCandless http://blog.mikemccandless.com On Tue, Nov 8, 2011 at

Re: large scale indexing issues / single threaded bottleneck

2011-10-29 Thread Michael McCandless
On Fri, Oct 28, 2011 at 3:27 PM, Simon Willnauer simon.willna...@googlemail.com wrote: one more thing, after somebody (thanks robert) pointed me at the stacktrace it seems kind of obvious what the root cause of your problem is. Its solr :) Solr closes the IndexWriter on commit which is very

Re: How to make UnInvertedField faster?

2011-10-22 Thread Michael McCandless
On Sat, Oct 22, 2011 at 4:10 AM, Simon Willnauer simon.willna...@googlemail.com wrote: On Fri, Oct 21, 2011 at 4:37 PM, Michael McCandless luc...@mikemccandless.com wrote: Well... the limitation of DocValues is that it cannot handle more than one value per document (which UnInvertedField can

Re: How to make UnInvertedField faster?

2011-10-21 Thread Michael McCandless
Well... the limitation of DocValues is that it cannot handle more than one value per document (which UnInvertedField can). Hopefully we can fix that at some point :) Mike McCandless http://blog.mikemccandless.com On Fri, Oct 21, 2011 at 7:50 AM, Simon Willnauer simon.willna...@googlemail.com

Re: Indexing PDF

2011-10-05 Thread Michael McCandless
Can you attach this PDF to an email send to the list? Or is it too large for that? Or, you can try running Tika directly on the PDF to see if it's able to extract the text. Mike McCandless http://blog.mikemccandless.com 2011/10/5 Héctor Trujillo hecto...@gmail.com: Sorry you have the

Re: Indexing PDF

2011-10-05 Thread Michael McCandless
Hmm, no attachment; maybe it's too large? Can you send it directly to me? Mike McCandless http://blog.mikemccandless.com 2011/10/5 Héctor Trujillo hecto...@gmail.com: This is the file that give me errors. 2011/10/5 Michael McCandless luc...@mikemccandless.com Can you attach this PDF

Re: Query failing because of omitTermFreqAndPositions

2011-10-04 Thread Michael McCandless
if omitPositions is made false. Thanks, Isan Fulia. On 29 September 2011 17:49, Michael McCandless luc...@mikemccandless.comwrote: Once a given field has omitted positions in the past, even for just one document, it sticks and that field will forever omit positions. Try creating a new index

Re: Query failing because of omitTermFreqAndPositions

2011-09-29 Thread Michael McCandless
Once a given field has omitted positions in the past, even for just one document, it sticks and that field will forever omit positions. Try creating a new index, never omitting positions from that field? Mike McCandless http://blog.mikemccandless.com On Thu, Sep 29, 2011 at 1:14 AM, Isan Fulia

Re: Example setting TieredMergePolicy for Solr 3.3 or 3.4?

2011-09-22 Thread Michael McCandless
On Wed, Sep 21, 2011 at 10:10 PM, Michael Sokolov soko...@ifactory.com wrote: I wonder if config-file validation would be helpful here :) I posted a patch in SOLR-1758 once. Big +1. We should aim for as stringent config file checking as possible. Mike McCandless

Re: Optimize fails with OutOfMemory Exception - sun.nio.ch.FileChannelImpl.map involved

2011-09-22 Thread Michael McCandless
Are you sure you are using a 64 bit JVM? Are you sure you really changed your vmem limit to unlimited? That should have resolved the OOME from mmap. Or: can you run cat /proc/sys/vm/max_map_count? This is a limit on the total number of maps in a single process, that Linux imposes. But the

Re: Optimize fails with OutOfMemory Exception - sun.nio.ch.FileChannelImpl.map involved

2011-09-22 Thread Michael McCandless
and now it looks like everything works as aspected. Need some further testing with the java versions, but I'm quite optimistic. Best regards Ralf Am 22.09.2011 14:46, schrieb Michael McCandless: Are you sure you are using a 64 bit JVM? Are you sure you really changed your vmem limit

Re: Optimize fails with OutOfMemory Exception - sun.nio.ch.FileChannelImpl.map involved

2011-09-22 Thread Michael McCandless
        soft    nofile  49151 Thanks, Shawn On 9/22/2011 9:56 AM, Michael McCandless wrote: OK, excellent.  Thanks for bringing closure, Mike McCandless http://blog.mikemccandless.com On Thu, Sep 22, 2011 at 9:00 AM, Ralf Matulatralf.matu...@bundestag.de  wrote: Dear Mike, thanks for your

Re: MMapDirectory failed to map a 23G compound index segment

2011-09-20 Thread Michael McCandless
Since you hit OOME during mmap, I think this is an OS issue not a JVM issue. Ie, the JVM isn't running out of memory. How many segments were in the unoptimized index? It's possible the OS rejected the mmap because of process limits. Run cat /proc/sys/vm/max_map_count to see how many mmaps are

[ANNOUNCE] Apache Solr 3.4.0 released

2011-09-14 Thread Michael McCandless
September 14 2011, Apache Solr™ 3.4.0 available The Lucene PMC is pleased to announce the release of Apache Solr 3.4.0. Apache Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit

Re: Nested documents

2011-09-11 Thread Michael McCandless
Even if it applies, this is for Lucene. I don't think we've added Solr support for this yet... we should! Mike McCandless http://blog.mikemccandless.com On Sun, Sep 11, 2011 at 12:16 PM, Erick Erickson erickerick...@gmail.com wrote: Does this JIRA apply?

Re: What will happen when one thread is closing a searcher while another is searching?

2011-09-06 Thread Michael McCandless
Closing a searcher while thread(s) is/are still using it is definitely bad, so, this code looks spooky... But: it possible something higher up (in Solr) is ensuring this code runs exclusively? I don't know enough about this part of Solr... Mike McCandless http://blog.mikemccandless.com On

heads up: re-index trunk Lucene/Solr indices

2011-08-20 Thread Michael McCandless
Hi, I just committed a new block tree terms dictionary implementation, which requires fully re-indexing any trunk indices. See here for details: https://issues.apache.org/jira/browse/LUCENE-3030 If you are using a released version of Lucene/Solr then you can ignore this message. Mike

Re: Solr Join in 3.3.x

2011-08-18 Thread Michael McCandless
Unfortunately Solr's join impl hasn't been backported to 3.x, as far as I know. You might want to look at ElasticSearch; it has a join implementation already or use Solr 4.0. Mike McCandless http://blog.mikemccandless.com On Wed, Aug 17, 2011 at 7:40 PM, Cameron Hurst wakemaste...@z33k.com

Re: segment.gen file is not replicated

2011-08-04 Thread Michael McCandless
This file is actually optional; its there for redundancy in case the filesystem is not reliable when listing a directory. Ie, normally, we list the directory to find the latest segments_N file; but if this is wrong (eg the file system might have stale a cache) then we fallback to reading the

Re: segment.gen file is not replicated

2011-08-04 Thread Michael McCandless
I think we should fix replication to copy it? Mike McCandless http://blog.mikemccandless.com On Thu, Aug 4, 2011 at 8:16 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Am 04.08.2011 12:52, schrieb Michael McCandless: This file is actually optional; its there for redundancy in case

Re: Field collapsing on multiple fields and/or ranges?

2011-07-06 Thread Michael McCandless
I believe the underlying grouping module is now technically able to do this, because subclasses of the abstract first/second pass grouping collectors are free to decide what type/value the group key is. But, we have to fix Solr to allow for compound keys by creating the necessary concrete

Re: Cannot I search documents added by IndexWriter after commit?

2011-07-05 Thread Michael McCandless
After your writer.commit you need to reopen your searcher to see the changes. Mike McCandless http://blog.mikemccandless.com On Tue, Jul 5, 2011 at 1:48 PM, Gabriele Kahlout gabri...@mysimpatico.com wrote:    @Test    public void testUpdate() throws IOException, ParserConfigurationException,

Re: Cannot I search documents added by IndexWriter after commit?

2011-07-05 Thread Michael McCandless
On Tue, Jul 5, 2011 at 8:09 PM, Michael McCandless luc...@mikemccandless.com wrote: After your writer.commit you need to reopen your searcher to see the changes. Mike McCandless http://blog.mikemccandless.com On Tue, Jul 5, 2011 at 1:48 PM, Gabriele Kahlout gabri...@mysimpatico.com wrote

Re: Fuzzy Query Param

2011-06-30 Thread Michael McCandless
Good question... I think in Lucene 4.0, the edit distance is (will be) in Unicode code points, but in past releases, it's UTF16 code units. Mike McCandless http://blog.mikemccandless.com 2011/6/30 Floyd Wu floyd...@gmail.com: if this is edit distance implementation, what is the result apply to

Re: Fuzzy Query Param

2011-06-29 Thread Michael McCandless
Which version of Solr (Lucene) are you using? Recent versions of Lucene now accept ~N 1 to be edit distance. Ie foobar~2 matches any term that's = 2 edit distance away from foobar. Mike McCandless http://blog.mikemccandless.com On Tue, Jun 28, 2011 at 11:00 PM, entdeveloper

Re: Optimize taking two steps and extra disk space

2011-06-21 Thread Michael McCandless
/2011 3:18 PM, Michael McCandless wrote: With segmentsPerTier at 35 you will easily cross 70 segs in the index... If you want optimize to run in a single merge, I would lower sementsPerTier and mergeAtOnce (maybe back to the 10 default), and set your maxMergeAtOnceExplicit to 70 or higher

Re: Optimize taking two steps and extra disk space

2011-06-21 Thread Michael McCandless
On Tue, Jun 21, 2011 at 9:42 AM, Shawn Heisey s...@elyograg.org wrote: On 6/20/2011 12:31 PM, Michael McCandless wrote: For back-compat, mergeFactor maps to both of these, but it's better to set them directly eg:     mergePolicy class=org.apache.lucene.index.TieredMergePolicy       int name

Re: Optimize taking two steps and extra disk space

2011-06-20 Thread Michael McCandless
On Sun, Jun 19, 2011 at 12:35 PM, Shawn Heisey s...@elyograg.org wrote: On 6/19/2011 7:32 AM, Michael McCandless wrote: With LogXMergePolicy (the default before 3.2), optimize respects mergeFactor, so it's doing 2 steps because you have 37 segments but 35 mergeFactor. With TieredMergePolicy

Re: Optimize taking two steps and extra disk space

2011-06-20 Thread Michael McCandless
On Mon, Jun 20, 2011 at 4:00 PM, Shawn Heisey s...@elyograg.org wrote: On 6/20/2011 12:31 PM, Michael McCandless wrote: Actually, TieredMP has two different params (different from the previous default LogMP):   * segmentsPerTier controls how many segments you can tolerate in the index

Re: Optimize taking two steps and extra disk space

2011-06-19 Thread Michael McCandless
With LogXMergePolicy (the default before 3.2), optimize respects mergeFactor, so it's doing 2 steps because you have 37 segments but 35 mergeFactor. With TieredMergePolicy (default on 3.2 and after), there is now a separate merge factor used for optimize (maxMergeAtOnceExplicit)... so you could

Re: Field Collapsing and Grouping in Solr 3.2

2011-06-16 Thread Michael McCandless
Alas, no, not yet.. grouping/field collapse has had a long history with Solr. There were many iterations on SOLR-236, but that impl was never committed. Instead, SOLR-1682 was committed, but committed only to trunk (never backported to 3.x despite requests). Then, a new grouping module was

  1   2   3   >