from:"Michael McCandless"

Re: Congratulations to the new Apache Solr PMC Chair, Jan Høydahl!

2021-02-18 Thread Michael McCandless

Congratulations and thank you, Jan!  It is so exciting that Solr is now a
TLP!

Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 18, 2021 at 1:56 PM Anshum Gupta  wrote:

> Hi everyone,
>
> I’d like to inform everyone that the newly formed Apache Solr PMC
> nominated and elected Jan Høydahl for the position of the Solr PMC Chair
> and Vice President. This decision was approved by the board in its February
> 2021 meeting.
>
> Congratulations Jan!
>
> --
> Anshum Gupta
>

Re: ExecutorService support in SolrIndexSearcher

2019-08-31 Thread Michael McCandless

We pass ExecutorService to Lucene's IndexSearcher at Amazon (for customer
facing product search) and it's a big win on long-pole query latencies, but
hurts red-line QPS (cluster capacity) a bit, due to less efficient
collection across segments and thread context switching.

I'm surprised it's not an option for Solr and Elasticsearch ... for certain
applications it's a huge win.

And yes as David points out -- Collectors (CollectorManagers) need to
support this "gather results for each segment separately then reduce in the
end" mode...

Mike McCandless

http://blog.mikemccandless.com

On Fri, Aug 30, 2019 at 4:45 PM David Smiley 
wrote:

> It'd take some work to do that.  Years ago I recall Etsy did a POC and
> shared their experience at Lucene/Solr Revolution in Washington DC; I
> attended the presentation with great interest.  One of the major obstacles,
> if I recall, was the Collector needs to support this mode of operation, and
> in particular Solr's means of flipping bits in a big bitset to accumulate
> the DocSet had to be careful so that multiple threads don't try to
> overwrite the same underlying "long" in the long[].
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Mon, Aug 26, 2019 at 7:02 AM Aghasi Ghazaryan
>  wrote:
>
> > Hi,
> >
> > Lucene's IndexSearcher
> > <
> >
> http://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/IndexSearcher.html#IndexSearcher-org.apache.lucene.index.IndexReaderContext-java.util.concurrent.ExecutorService-
> > >
> > supports
> > running searches for each segment separately, using the provided
> > ExecutorService.
> > I wonder why SolrIndexSearcher does not support the same as it may
> improve
> > queries performance a lot?
> >
> > Thanks, looking forward to hearing from you.
> >
> > Regards
> > Aghasi Ghazaryan
> >
>

Re: Mistake assert tips in FST builder ？

2019-04-22 Thread Michael McCandless

Hello,

Indeed, you cosmetic fix looks great -- I'll push that change. Thanks for
noticing and raising!

Mike McCandless

http://blog.mikemccandless.com


On Tue, Apr 16, 2019 at 12:04 AM zhenyuan wei  wrote:

> Hi，
>With current newest version, 9.0.0-snapshot，In
> Builder.UnCompileNode.addArc() function，
> found this line：
>
> assert numArcs == 0 || label > arcs[numArcs-1].label: "arc[-1].label="
> + arcs[numArcs-1].label + " new label=" + label + " numArcs=" +
> numArcs;
>
> Maybe assert tips is ：
>
> assert numArcs == 0 || label > arcs[numArcs-1].label:
> "arc[numArc-1].label=" + arcs[numArcs-1].label + " new label=" + label
> + " numArcs=" + numArcs;
>
> Is it a personal tips code style? or small mistake?
>
> Just curious about it.
>

Re: Long blocking during indexing + deleteByQuery

2017-11-08 Thread Michael McCandless

I'm not sure this is what's affecting you, but you might try upgrading to
Lucene/Solr 7.1; in 7.0 there were big improvements in using multiple
threads to resolve deletions:
http://blog.mikemccandless.com/2017/07/lucene-gets-concurrent-deletes-and.html

Mike McCandless

http://blog.mikemccandless.com

On Tue, Nov 7, 2017 at 2:26 PM, Chris Troullis  wrote:

> @Erick, I see, thanks for the clarification.
>
> @Shawn, Good idea for the workaround! I will try that and see if it
> resolves the issue.
>
> Thanks,
>
> Chris
>
> On Tue, Nov 7, 2017 at 1:09 PM, Erick Erickson 
> wrote:
>
> > bq: you think it is caused by the DBQ deleting a document while a
> > document with that same ID
> >
> > No. I'm saying that DBQ has no idea _if_ that would be the case so
> > can't carry out the operations in parallel because it _might_ be the
> > case.
> >
> > Shawn:
> >
> > IIUC, here's the problem. For deleteById, I can guarantee the
> > sequencing through the same optimistic locking that regular updates
> > use (i.e. the _version_ field). But I'm kind of guessing here.
> >
> > Best,
> > Erick
> >
> > On Tue, Nov 7, 2017 at 8:51 AM, Shawn Heisey 
> wrote:
> > > On 11/5/2017 12:20 PM, Chris Troullis wrote:
> > >> The issue I am seeing is when some
> > >> threads are adding/updating documents while other threads are issuing
> > >> deletes (using deleteByQuery), solr seems to get into a state of
> extreme
> > >> blocking on the replica
> > >
> > > The deleteByQuery operation cannot coexist very well with other
> indexing
> > > operations.  Let me tell you about something I discovered.  I think
> your
> > > problem is very similar.
> > >
> > > Solr 4.0 and later is supposed to be able to handle indexing operations
> > > at the same time that the index is being optimized (in Lucene,
> > > forceMerge).  I have some indexes that take about two hours to
> optimize,
> > > so having indexing stop while that happens is a less than ideal
> > > situation.  Ongoing indexing is similar in many ways to a merge, enough
> > > that it is handled by the same Merge Scheduler that handles an
> optimize.
> > >
> > > I could indeed add documents to the index without issues at the same
> > > time as an optimize, but when I would try my full indexing cycle while
> > > an optimize was underway, I found that all operations stopped until the
> > > optimize finished.
> > >
> > > Ultimately what was determined (I think it was Yonik that figured it
> > > out) was that *most* indexing operations can happen during the
> optimize,
> > > *except* for deleteByQuery.  The deleteById operation works just fine.
> > >
> > > I do not understand the low-level reasons for this, but apparently it's
> > > not something that can be easily fixed.
> > >
> > > A workaround is to send the query you plan to use with deleteByQuery as
> > > a standard query with a limited fl parameter, to retrieve matching
> > > uniqueKey values from the index, then do a deleteById with that list of
> > > ID values instead.
> > >
> > > Thanks,
> > > Shawn
> > >
> >
>

Re: SOLR-11504: Provide a config to restrict number of indexing threads

2017-11-02 Thread Michael McCandless

Actually, it's one lucene segment per *concurrent* indexing thread.

So if you have 10 indexing threads in Lucene at once, then 10 in-memory
segments will be created and will have to be written on refresh/commit.

Elasticsearch uses a bounded thread pool to service all indexing requests,
which I think is a healthy approach.  It shouldn't have to be the client's
job to worry about server side details like this.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Nov 2, 2017 at 5:23 AM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Nawab,
>
> > One indexing thread in lucene  corresponds to one segment being written.
> I need a fine control on the number of segments.
>
> I didn’t check the code, but I would be surprised that it is how things
> work. It can appear that it is working like that if each client thread is
> doing commits. Is that the case?
>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 1 Nov 2017, at 18:00, Nawab Zada Asad Iqbal  wrote:
> >
> > Well, the reason i want to control number of indexing threads is to
> > restrict number of "segments" being created at one time in the RAM. One
> > indexing thread in lucene  corresponds to one segment being written. I
> need
> > a fine control on the number of segments. Less than that, and I will not
> be
> > fully utilizing my writing capacity. On the other hand, if I have more
> > threads, then I will end up a lot more segments of small size, which I
> will
> > need to flush frequently and then merge, and that will cause a different
> > kind of problem.
> >
> > Your suggestion will require me and other such solr users to create a
> tight
> > coupling between the clients and the Solr servers. My client is not SolrJ
> > based. IN a scenario when I am connecting and indexing to Solr remotely,
> I
> > want more requests to be waiting on the solr side so that they start
> > writing as soon as an Indexing thread is available, vs waiting on my
> client
> > side - on the other side of the wire.
> >
> > Thanks
> > Nawab
> >
> > On Wed, Nov 1, 2017 at 7:11 AM, Shawn Heisey 
> wrote:
> >
> >> On 10/31/2017 4:57 PM, Nawab Zada Asad Iqbal wrote:
> >>
> >>> I hit this issue https://issues.apache.org/jira/browse/SOLR-11504
> while
> >>> migrating to solr6 and locally working around it in Lucene code. I am
> >>> thinking to fix it properly and hopefully patch back to Solr. Since,
> >>> Lucene
> >>> code does not want to keep any such config, I am thinking to use a
> >>> counting
> >>> semaphore in Solr code before calling IndexWriter.addDocument(s) or
> >>> IndexWriter.updateDocument(s).
> >>>
> >>
> >> There's a fairly simple way to control the number of indexing threads
> that
> >> doesn't require ANY changes to Solr:  Don't start as many
> threads/processes
> >> on your indexing client(s).  If you control the number of simultaneous
> >> requests sent to Solr, then Solr won't start as many indexing threads.
> >> That kind of control over your indexing system is something that's
> always
> >> preferable to have.
> >>
> >> Thanks,
> >> Shawn
> >>
>
>

Re: StandardDirectoryReader.java:: applyAllDeletes, writeAllDeletes

2017-05-29 Thread Michael McCandless

If you are not using NRT readers then the applyAllDeletes/writeAllDeletes
boolean values are completely unused (and should have no impact on your
performance).

Mike McCandless

http://blog.mikemccandless.com

On Sun, May 28, 2017 at 8:34 PM, Nawab Zada Asad Iqbal <khi...@gmail.com>
wrote:

> After reading some more code it seems if we are sure that there are no
> deletes in this segment/index, then setting  applyAllDeletes and
> writeAllDeletes both to false will achieve similar to what I was getting in
> 4.5.0
>
> However, after I read the comment from IndexWriter::DirectoryReader
> getReader(boolean applyAllDeletes, boolean writeAllDeletes) , it seems that
> this method is particular to NRT.  Since we are not using soft commits, can
> this change actually improve our performance during full reindex?
>
>
> Thanks
> Nawab
>
>
>
>
>
>
>
>
>
> On Sun, May 28, 2017 at 2:16 PM, Nawab Zada Asad Iqbal <khi...@gmail.com>
> wrote:
>
>> Thanks Michael and Shawn for the detailed response. I was later able to
>> pull the full history using gitk; and found the commits behind this patch.
>>
>> Mike:
>>
>> So, in solr 4.5.0 ; some earlier developer has added code and config to
>> set applyAllDeletes to false when we reindex all the data.  At the moment,
>> I am not sure about the performance gain by this.
>>
>> 
>>
>>
>> I am investigating the question, if this change is still needed in 6.5.1
>> or can this be achieved by any other configuration?
>>
>> For now, we are not planning to use NRT and solrCloud.
>>
>>
>> Thanks
>> Nawab
>>
>> On Sun, May 28, 2017 at 9:26 AM, Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>>> Sorry, yes, that commit was one of many on a feature branch I used to
>>> work on LUCENE-5438, which added near-real-time index replication to
>>> Lucene.  Before this change, Lucene's replication module required a commit
>>> in order to replicate, which is a heavy operation.
>>>
>>> The writeAllDeletes boolean option asks Lucene to move all recent
>>> deletes (tombstone bitsets) to disk while opening the NRT (near-real-time)
>>> reader.
>>>
>>> Normally Lucene won't always do that, and will instead carry the bitsets
>>> in memory from writer to reader, for reduced refresh latency.
>>>
>>> What sort of custom changes do you have in this part of Lucene?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Sat, May 27, 2017 at 10:35 PM, Nawab Zada Asad Iqbal <
>>> khi...@gmail.com> wrote:
>>>
>>>> Hi all
>>>>
>>>> I am looking at following change in lucene-solr which doen't mention any
>>>> JIRA. How can I know more about it?
>>>>
>>>> "1ae7291 Mike McCandless on 1/24/16 at 3:17 PM current patch"
>>>>
>>>> Specifically, I am interested in what 'writeAllDeletes'  does in the
>>>> following method. Let me know if it is very stupid question and I should
>>>> have done something else before emailing here.
>>>>
>>>> static DirectoryReader open(IndexWriter writer, SegmentInfos infos,
>>>> boolean applyAllDeletes, boolean writeAllDeletes) throws IOException {
>>>>
>>>> Background: We are running solr4.5 and upgrading to 6.5.1. We have
>>>> some custom code in this area, which we need to merge.
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Nawab
>>>>
>>>
>>>
>>
>

Re: StandardDirectoryReader.java:: applyAllDeletes, writeAllDeletes

2017-05-28 Thread Michael McCandless

Sorry, yes, that commit was one of many on a feature branch I used to work
on LUCENE-5438, which added near-real-time index replication to Lucene.
Before this change, Lucene's replication module required a commit in order
to replicate, which is a heavy operation.

The writeAllDeletes boolean option asks Lucene to move all recent deletes
(tombstone bitsets) to disk while opening the NRT (near-real-time) reader.

Normally Lucene won't always do that, and will instead carry the bitsets in
memory from writer to reader, for reduced refresh latency.

What sort of custom changes do you have in this part of Lucene?

Mike McCandless

http://blog.mikemccandless.com

On Sat, May 27, 2017 at 10:35 PM, Nawab Zada Asad Iqbal 
wrote:

> Hi all
>
> I am looking at following change in lucene-solr which doen't mention any
> JIRA. How can I know more about it?
>
> "1ae7291 Mike McCandless on 1/24/16 at 3:17 PM current patch"
>
> Specifically, I am interested in what 'writeAllDeletes'  does in the
> following method. Let me know if it is very stupid question and I should
> have done something else before emailing here.
>
> static DirectoryReader open(IndexWriter writer, SegmentInfos infos,
> boolean applyAllDeletes, boolean writeAllDeletes) throws IOException {
>
> Background: We are running solr4.5 and upgrading to 6.5.1. We have
> some custom code in this area, which we need to merge.
>
>
> Thanks
>
> Nawab
>

Re: AnalyzingInfixSuggester performance

2017-04-18 Thread Michael McCandless

It also indexes edge ngrams for short sequences (e.g. a*, b*, etc.) and
switches to ordinary PrefixQuery for longer sequences, and does some work
to at search time to do the "infixing".

But yeah otherwise that's it.

If your ranking at lookup isn't exactly matching the weight, but "roughly"
has some correlation to it, you could still use the fast early termination,
except collect deeper than just the top N to ensure you likely found the
best hits according to your ranking function.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Apr 18, 2017 at 4:35 PM, OTH <omer.t@gmail.com> wrote:

> I see.  I had actually overlooked the fact that Suggester provides a
> 'weightField', and I could possibly use that in my case instead of the
> regular Solr index with bq.
>
> So if I understand then - the main advantage of using the
> AnalyzingInfixSuggester instead of a regular Solr index (since both are
> using standard Lucene?) is that the AInfixSuggester does sorting at
> index-time using the weightField?  So it's only ever advantageous to use
> this Suggester if you need sorting based on a field?
>
> Thanks
>
> On Tue, Apr 18, 2017 at 2:20 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
> > AnalyzingInfixSuggester uses index-time sort, to sort all postings by the
> > suggest weight, so that lookup, as long as your sort by the suggest
> weight
> > is extremely fast.
> >
> > But if you need to rank at lookup time by something not "congruent" with
> > the index-time sort then you lose that benefit.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Sun, Apr 16, 2017 at 11:46 AM, OTH <omer.t@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > From what I understand, the AnalyzingInfixSuggester is using a simple
> > > Lucene query; so I was wondering, how then would this suggester have
> > better
> > > performance than using a simple Solr 'select' query on a regular Solr
> > index
> > > (with an asterisk placed at the start and end of the query string).  I
> > > could understand why say an FST based suggester would be faster, but I
> > > wanted to confirm if that indeed is the case with
> > AnalyzingInfixSuggester.
> > >
> > > One reason I ask is:
> > > I needed the results to be boosted based on the value of another field;
> > > e.g., if a user in the UK is searching for cities, then I'd need the
> > cities
> > > which are in the UK to be boosted.  I was able to do this with a
> regular
> > > Solr index by adding something like these parameters:
> > > defType=edismax=country:UK^2.0
> > >
> > > However, I'm not sure if this is possible with the Suggester.
> Moreover -
> > > other than the 'country' field above, there are other fields as well
> > which
> > > I need to be returned with the results.  Since the Suggester seems to
> > only
> > > allow one additional field, called 'payload', I'm able to do this by
> > > putting the values of all the other fields into a JSON and then placing
> > > that into the 'payload' field - however, I don't know if it would be
> > > possible then to incorporate the boosting mechanism I showed above.
> > >
> > > So I was thinking of just using a regular Solr index instead of the
> > > Suggester; I wanted to confirm, what if any is the performance
> > improvement
> > > in using the AnalyzingInfixSuggester over using a regular index?
> > >
> > > Much thanks
> > >
> >
>

Re: AnalyzingInfixSuggester performance

2017-04-18 Thread Michael McCandless

AnalyzingInfixSuggester uses index-time sort, to sort all postings by the
suggest weight, so that lookup, as long as your sort by the suggest weight
is extremely fast.

But if you need to rank at lookup time by something not "congruent" with
the index-time sort then you lose that benefit.

Mike McCandless

http://blog.mikemccandless.com

On Sun, Apr 16, 2017 at 11:46 AM, OTH  wrote:

> Hello,
>
> From what I understand, the AnalyzingInfixSuggester is using a simple
> Lucene query; so I was wondering, how then would this suggester have better
> performance than using a simple Solr 'select' query on a regular Solr index
> (with an asterisk placed at the start and end of the query string).  I
> could understand why say an FST based suggester would be faster, but I
> wanted to confirm if that indeed is the case with AnalyzingInfixSuggester.
>
> One reason I ask is:
> I needed the results to be boosted based on the value of another field;
> e.g., if a user in the UK is searching for cities, then I'd need the cities
> which are in the UK to be boosted.  I was able to do this with a regular
> Solr index by adding something like these parameters:
> defType=edismax=country:UK^2.0
>
> However, I'm not sure if this is possible with the Suggester.  Moreover -
> other than the 'country' field above, there are other fields as well which
> I need to be returned with the results.  Since the Suggester seems to only
> allow one additional field, called 'payload', I'm able to do this by
> putting the values of all the other fields into a JSON and then placing
> that into the 'payload' field - however, I don't know if it would be
> possible then to incorporate the boosting mechanism I showed above.
>
> So I was thinking of just using a regular Solr index instead of the
> Suggester; I wanted to confirm, what if any is the performance improvement
> in using the AnalyzingInfixSuggester over using a regular index?
>
> Much thanks
>

Re: Is there a way to tell if multivalued field actually contains multiple values?

2016-11-11 Thread Michael McCandless

I think you can use the term stats that Lucene tracks for each field.

Compare Terms.getSumTotalTermFreq and Terms.getDocCount.  If they are
equal it means every document that had this field, had only one token.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Nov 11, 2016 at 5:50 AM, Mikhail Khludnev  wrote:
> I suppose it's needless to remind that norm(field) is proportional (but not
> precisely by default) to number of tokens in a doc's field (although not
> actual text values).
>
> On Fri, Nov 11, 2016 at 5:08 AM, Alexandre Rafalovitch 
> wrote:
>
>> Hello,
>>
>> Say I indexed a large dataset against a schemaless configuration. Now
>> I have a bunch of multivalued fields. Is there any way to say which of
>> these (text) fields have (for given data) only single values? I know I
>> am supposed to look at the original data, and all that, but this is
>> more for debugging/troubleshooting.
>>
>> Turning termOffsets/termPositions would make it easy, but that's a bit
>> messy for troubleshooting purposes.
>>
>> I was thinking that one giveaway is the positionIncrementGap causing
>> the second value's token to start at number above a hundred. But I am
>> not sure how to craft a query against a field to see if such a token
>> is generically present.
>>
>>
>> Any ideas?
>>
>> Regards,
>> Alex.
>>
>> 
>> Solr Example reading group is starting November 2016, join us at
>> http://j.mp/SolrERG
>> Newsletter and resources for Solr beginners and intermediates:
>> http://www.solr-start.com/
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev

[ANNOUNCE] Apache Solr 6.2.0 released

2016-08-26 Thread Michael McCandless

26 August 2016, Apache Solr 6.2.0 available

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as parallel SQL.
Solr is enterprise grade, secure and highly scalable, providing fault
tolerant distributed search and indexing, and powers the search and
navigation features of many of the world's largest internet sites.

Solr 6.2.0 is available for immediate download at:

 * http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr 6.2.0 Release Highlights:

DocValues, streaming, /export, machine learning
* DocValues can now be used with BoolFields
* Date and boolean support added to /export handler
* Add "scoreNodes" streaming graph expression
* Support parallel ETL with the "topic" expression
* Feature selection and logistic regression on text via new streaming
expressions: "features" and "train"

bin/solr script
* Add basic auth support to the bin/solr script
* File operations to/from Zookeeper are now supported

SolrCloud
* New tag 'role' in replica placement rules, e.g. rule=role:!overseer
keeps new repicas off overseer nodes
* CDCR: fall back to whole-index replication when tlogs are insufficient
* New REPLACENODE command to decommission an existing node and replace
it with another new node
* New DELETENODE command to delete all replicas on a node

Security
* Add Kerberos delegation token support
* Support secure impersonation / proxy user for Kerberos authentication

Misc changes
* A large number of regressions were fixed in the new Admin UI
* New boolean comparison function queries comparing numeric arguments:
gt, gte, lt, lte, eq
* Upgraded Extraction module to Apache Tika 1.13.
* Updated to Hadoop 2.7.2

Further details of changes are available in the change log available at:
http://lucene.apache.org/solr/6_2_0/changes/Changes.html

Please report any feedback to the mailing lists (
http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also applies to Maven access.

Happy searching,

Mike McCandless

http://blog.mikemccandless.com

Re: ConcurrentMergeScheduler options not exposed

2016-06-17 Thread Michael McCandless

Really we need the infoStream output, to see what IW is doing, to take so
long merging.

Likely only one merge thread is running (CMS tries to detect if your IO
system "spins" and if so, uses 1 merge thread) ... maybe try configuring
this to something higher since your RAID array can probably handle it?

It's good that disabling auto IO throttling didn't change things ... that's
what I expected (since forced merges are not throttled by default).

Maybe capture all thread stacks and post back here?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jun 16, 2016 at 4:04 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 6/16/2016 2:35 AM, Michael McCandless wrote:
> >
> > Hmm, merging can't read at 800 MB/sec and only write at 20 MB/sec for
> > very long ... unless there is a huge percentage of deletes. Also, by
> > default CMS doesn't throttle forced merges (see
> > CMS.get/setForceMergeMBPerSec). Maybe capture
> > IndexWriter.setInfoStream output?
>
> I can see the problem myself.  I have a RAID10 array with six SATA
> disks.  When I click the Optimize button for a core that's several
> gigabytes, iotop shows me reads happening at about 100MB/s for several
> seconds, then writes clocking no more than 25 MB/s, and usually a lot
> less.  The last several gigabytes that were written were happening at
> less than 5 MB/s.  This is VERY slow, and does affect my nightly
> indexing processes.
>
> Asking the shell to copy a 5GB file revealed sustained write rates of
> over 500MB/s, so the hardware can definitely go faster.
>
> I patched in an option for solrconfig.xml where I could force it to call
> disableAutoIOThrottle().  I included logging in my patch to make
> absolutely sure that the new code was used.  This option made no
> difference in the write speed.  I also enabled infoStream, but either I
> configured it wrong or I do not know where to look for the messages.  I
> was modifying and compiling branch_5_5.
>
> This is the patch that I applied:
>
> http://apaste.info/wKG
>
> I did see the expected log entries in solr.log when I restarted with the
> patch and the new option in solrconfig.xml.
>
> What else can I look at?
>
> Thanks,
> Shawn
>
>

Re: ConcurrentMergeScheduler options not exposed

2016-06-16 Thread Michael McCandless

Hmm, merging can't read at 800 MB/sec and only write at 20 MB/sec for very
long ... unless there is a huge percentage of deletes.

Also, by default CMS doesn't throttle forced merges (see
CMS.get/setForceMergeMBPerSec).

Maybe capture IndexWriter.setInfoStream output?

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jun 15, 2016 at 9:12 PM, Shawn Heisey  wrote:

> On the IRC channel, I ran into somebody who was having problems with
> optimizes on their Solr indexes taking a really long time.  When
> investigating, they found that during the optimize, *reads* were
> happening on their SSD disk at over 800MB/s, but *writes* were
> proceeding at only 20 MB/s.
>
> Looking into ConcurrentMergeScheduler, I discovered that it does indeed
> have a default write throttle of only 20 MB/s.  I saw code that would
> sometimes set the speed to unlimited, but had a hard time figuring out
> what circumstances will result in the different settings, so based on
> the user experience, I assume that the 20MB/s throttle must be applied
> for Solr optimizes.
>
> From what I can see in the code, there's currently no way in
> solrconfig.xml to configure scheduler options like the maximum write
> speed.  Before I an open an issue to add additional configuration
> options for the merge scheduler, I thought it might be a good idea to
> just double-check with everyone here to see whether there's something I
> missed.
>
> This is likely even affecting people who are not using SSD storage.
> Most modern magnetic disks can easily exceed 20MB/s on both reads and
> writes.  Some RAID arrays can write REALLY fast.
>
> Thanks,
> Shawn
>
>

Re: Lucene/Solr Git Mirrors 5 day lag behind SVN?

2015-10-24 Thread Michael McCandless

I added a comment on the INFRA issue.

I don't understand why it periodically "gets stuck".

Mike McCandless

http://blog.mikemccandless.com


On Fri, Oct 23, 2015 at 11:27 AM, Kevin Risden
 wrote:
> It looks like both Apache Git mirror (git://git.apache.org/lucene-solr.git)
> and GitHub mirror (https://github.com/apache/lucene-solr.git) are 5 days
> behind SVN. This seems to have happened before:
> https://issues.apache.org/jira/browse/INFRA-9182
>
> Is this a known issue?
>
> Kevin Risden

Re: CheckIndex failed for Solr 4.7.2 index

2015-06-09 Thread Michael McCandless

IBM's J9 JVM unfortunately still has a number of nasty bugs affecting
Lucene; most likely you are hitting one of these.  We used to test J9
in our continuous Jenkins jobs, but there were just too many
J9-specific failures and we couldn't get IBM's attention to resolve
them, so we stopped.  For now you should switch to Oracle JDK, or
OpenJDK.

But there's some good news!  Recently, a member from the IBM JDK team
replied to this Elasticsearch thread:
https://discuss.elastic.co/t/need-help-with-ibm-jdk-issues-with-es-1-4-5/1748/3

And then Robert Muir ran Lucene's tests with the latest J9 and opened
several issues; see the 2nd bullet under Apache Lucene at
https://www.elastic.co/blog/this-week-in-elasticsearch-and-apache-lucene-2015-06-09
and at least one of the issues seems to be making progress
(https://issues.apache.org/jira/browse/LUCENE-6522).

So there is hope for the future, but for today it's too dangerous to
use J9 with Lucene/Solr/Elasticsearch.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Jun 9, 2015 at 12:23 PM, Guy Moshkowich g...@il.ibm.com wrote:
 We are using Solr 4.7.2 and we found that when we run
 CheckIndex.checkIndex on one of the Solr shards we are getting the error
 below.
 Both replicas of the shard had the same error.
 The shard index looked healthy:
 1) It appeared active in the Solr admin page.
 2) We could run searches against it.
 3) No relevant errors where found in Solr logs.
 4) After we optimized the index in LUKE, CheckIndex did not report any
 error.

 My questions:
 1) Is this is a real issue or a known bug in CheckIndex code that cause
 false negative ?
 2) Is there a known fix for this issue?

 Here is the error we got:
  validateIndex Segments file=segments_bhe numSegments=15 version=4.7
 format= userData={commitTimeMSec=1432689607801}
   1 of 15: name=_6cth docCount=248744
 codec=Lucene46
 compound=false
 numFiles=11
 size (MB)=86.542
 diagnostics = {timestamp=1428883354605, os=Linux,
 os.version=2.6.32-431.23.3.el6.x86_64, mergeFactor=10, source=merge,
 lucene.version=4.7.2 1586229 - rmuir - 2014-04-10 09:00:35, os.arch=amd64,
 mergeMaxNumSegments=-1, java.version=1.7.0, java.vendor=IBM Corporation}
 has deletions [delGen=3174]
 test: open reader.FAILED
 WARNING: fixIndex() would remove reference to this segment; full
 exception:
 java.lang.RuntimeException: liveDocs count mismatch: info=156867, vs
 bits=156872
 at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:581)
 at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:372)

 Appreciate yout help,
 Guy.

[ANNOUNCE] Apache Solr 4.10.4 released

2015-03-05 Thread Michael McCandless

October 2014, Apache Solr™ 4.10.4 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.10.4

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.10.4 is available for immediate download at:

http://www.apache.org/dyn/closer.cgi/lucene/solr/4.10.4

Solr 4.10.4 includes 24 bug fixes, as well as Lucene 4.10.4 and its 13
bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.

Mike McCandless

http://blog.mikemccandless.com

Re: Frequent deletions

2015-01-01 Thread Michael McCandless

Also see this G+ post I wrote up recently showing how %tg deletions
changes over time for an every add also deletes a previous document
stress test: https://plus.google.com/112759599082866346694/posts/MJVueTznYnD

Mike McCandless

http://blog.mikemccandless.com


On Wed, Dec 31, 2014 at 12:21 PM, Erick Erickson
erickerick...@gmail.com wrote:
 It's usually not necessary to optimize, as more indexing happens you
 should see background merges happen that'll reclaim the space, so I
 wouldn't worry about it unless you're seeing actual problems that have
 to be addressed. Here's a great visualization of the process:

 http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

 See especially the third video, TieredMergePolicy which is the default.

 If you insist, however, try a commit with expungeDeletes=true

 and if that isn't enough, try an optimize call
 you can issue a force merge (aka optimize)  command from the URL (Or
 cUrl etc) as:
 http://localhost:8983/solr/techproducts/update?optimize=true

 But please don't do this unless it's absolutely necessary. You state
 that you have frequent deletions, but eventually this shoul dall
 happen in the background. Optimize is a fairly expensive operation and
 should be used judiciously.

 Best,
 Erick

 On Wed, Dec 31, 2014 at 1:32 AM, ig01 inna.gel...@elbitsystems.com wrote:
 Hello,
 We perform frequent deletions from our index, which greatly increases the
 index size.
 How can we perform an optimization in order to reduce the size.
 Please advise,
 Thanks.




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: DocsEnum and TermsEnum reuse in lucene join library?

2014-12-06 Thread Michael McCandless

They should be reused if the impl. allows for it.

Besides reducing GC cost, it can also be a sizable performance gain
since these enums can have quite a bit of state that otherwise must be
re-initialized.

If you really don't want to reuse them (force a new enum every time), pass null.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Dec 5, 2014 at 8:14 PM, Darin Amos dari...@gmail.com wrote:
 Hi All,

 I have been working on a custom query and I am going off of samples in the 
 lucene join library (4.3.0) and I am a little unclear about a couple lines.

 1) When getting a TermsEnum in 
 TermsIncludingScoreQuery.createWeight(…).scorer()… A previous TermsEnum is 
 used like the following:

 segmentTermsEnum = terms.iterator(segmentTermsEnum);

 2) When getting a DocsEnum SVInOrderScorer.fillDocsAndScores:

  for (int i = 0; i  terms.size(); i++) {
 if (termsEnum.seekExact(terms.get(ords[i], spare), true)) {
   docsEnum = termsEnum.docs(acceptDocs, docsEnum, DocsEnum.FLAG_NONE);

 My assumption is that the previous enum values are not reused, but this is a 
 tuning mechanism for garbage collection, is the correct assumption?

 Thanks!

 Darin

[ANNOUNCE] Apache Solr 4.10.2 released

2014-10-31 Thread Michael McCandless

October 2014, Apache Solr™ 4.10.2 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.10.2

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.10.2 is available for immediate download at:

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr 4.10.2 includes 10 bug fixes, as well as Lucene 4.10.2 and its 2 bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.

Happy Halloween,

Mike McCandless

http://blog.mikemccandless.com

[ANNOUNCE] Apache Solr 4.10.1 released

2014-09-29 Thread Michael McCandless

September 2014, Apache Solr™ 4.10.1 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.10.1

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.10.1 is available for immediate download at:

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr 4.10.1 includes 6 bug fixes, as well as Lucene 4.10.1 and its 7 bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.

Mike McCandless

http://blog.mikemccandless.com

[ANNOUNCE] Apache Solr 4.9.1 released

2014-09-22 Thread Michael McCandless

September 2014, Apache Solr™ 4.9.1 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.9.1

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.9.1 is available for immediate download at:

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr 4.9.1 includes 2 bug fixes, as well as Lucene 4.9.1 and its 7 bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.

Mike McCandless

http://blog.mikemccandless.com

Re: [ANNOUNCE] Apache Solr 4.9.1 released

2014-09-22 Thread Michael McCandless

I'll merge back the 4.9.1 CHANGES entries so when we do a 4.10.1,
they'll be there ... and I'll also make sure any fix we backported for
4.9.1, we also backport for 4.10.1.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Sep 22, 2014 at 9:11 AM, Shawn Heisey s...@elyograg.org wrote:
 On 9/22/2014 6:24 AM, Bernd Fehling wrote:
 This confuses me a bit, aren't we already at 4.10.0?

 But CHANGES.txt of 4.10.0 doesn't know anything about 4.9.1.

 Is this an interim version or something about backward compatibility?

 It's a bugfix release, fixing some showstopper bugs in a recent release
 that is critical to the RM (Michael McCandless) and/or an organization
 where he has influence or liability.  Apparently this was a more
 expedient path than completely validating a 4.10 upgrade and waiting for
 the 4.10.1 bugfix release.  Validating the 4.10 upgrade probably would
 have taken considerably longer than simply backporting some critical
 fixes to the 4.9 release that they're actually using.

 The two bug fixes for Solr are a license issue and a security
 vulnerability.  The bugfix list for Lucene includes fixes for some major
 problems that can cause index corruption or incorrect operation.

 I had thought that the CHANGES.txt list would remain the same for trunk
 and the stable branch because some of those bugfixes skipped the 4.10.0
 release, but it looks like that's not the case for LUCENE-5919 (the only
 one that I actually investigated).  If these issues all got updated to
 the 4.9.1 section of CHANGES.txt in places other than the 4.9 branch and
 the 4.9.1 tag, there might be a small amount of confusion in the distant
 future.  That confusion would be cleared up by looking at CHANGES.txt
 for the 4.10.0 release, though.

 Looks like the 4.10.1 release has been delayed a little.  I hope that
 this collection of fixes makes it in there too, so that 4.10.0 is the
 only release where that confusion might impact users.

 Thanks,
 Shawn

Re: optimize and .nfsXXXX files

2014-08-18 Thread Michael McCandless

Soft commit (i.e. opening a new IndexReader in Lucene and closing the
old one) should make those go away?

The .nfsX files are created when a file is deleted but a local
process (in this case, the current Lucene IndexReader) still has the
file open.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Aug 18, 2014 at 5:20 AM, BorisG boris.golo...@mail.huji.ac.il wrote:
 Hi,
 I am using solr 3.6.2.
 I use NFS and my index folder is a mounted folder.
 When I run the command:
 server:port/solr/collection1/update?optimize=truemaxSegments=1waitFlush=trueexpungeDeletes=true
 in order to optimize my index, I have some .nfsX files created while the
 optimize is running.
 The problem that i am having is that after optimize finishes its run the
 .nfs files aren't deleted.
 When I close the solr process they immediately disappear.
 I don't want to restart the solr process after each optimize, is there
 anything that can be done in order for solr to get rid of those files.

 Thanks,




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/optimize-and-nfs-files-tp4153473.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Does lucene uses tries?

2014-06-05 Thread Michael McCandless

The default terms dictionary (BlockTree) also uses a trie index
structure to locate the block on disk that may contain a target term.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Jun 5, 2014 at 12:11 PM, Shawn Heisey s...@elyograg.org wrote:
 I just have want know that does the lucene used the tries data structure
 to
 store the
 data.

 Lucene (and Solr) will use whatever you tell it when you create the field.

 If you indicate in your schema fieldType that you want to use a class of
 solr.TrieIntField, then the field will use a Lucene trie type that holds
 integers. Similar for TrieLongField, TrieFloatField, etc.

 Thanks,
 Shawn

Re: Expected date of release for Solr 4.7.1

2014-03-29 Thread Michael McCandless

RC2 is being voted on now ... so it should be soon (a few days, but
more if any new blocker issues are found and we need to do RC3).

Mike McCandless

http://blog.mikemccandless.com


On Sat, Mar 29, 2014 at 2:26 PM, Puneet Pawaia puneet.paw...@gmail.com wrote:
 Hi
 Any idea on the expected date of release for Solr 4.7.1
 Regards
 Puneet

Re: Enabling other SimpleText formats besides postings

2014-03-28 Thread Michael McCandless

You told the fieldType to use SimpleText only for the postings, not
all other parts of the codec (doc values, live docs, stored fields,
etc...), and so it used the default codec for those components.

If instead you used the SimpleTextCodec (not sure how to specify this
in Solr's schema.xml) then all components would be SimpleText.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Mar 28, 2014 at 8:53 AM, Ken Krugler
kkrugler_li...@transpac.com wrote:
 Hi all,

 I've been using the SimpleTextCodec in the past, but I just noticed something 
 odd...

 I'm running Solr 4.3, and enable the SimpleText posting format via something 
 like:

 fieldType name=date class=solr.DateField postingsFormat=SimpleText 
 /

 The resulting index does have the expected _0_SimpleText_0.pst text output, 
 but I just noticed that the other files are all the standard binary format 
 (e.g. .fdt for field data)

 Based on SimpleTextCodec.java, I was assuming that I'd get the 
 SimpleTextStoredFieldsFormat for stored data.

 This same holds true for most (all?) of the other files, e.g. 
 https://issues.apache.org/jira/browse/LUCENE-3074 is about adding a simple 
 text format for DocValues.

 I can walk the code to figure out what's up, but I'm hoping I just need to 
 change some configuration setting.

 Thanks!

 -- Ken

 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr

Re: AutoSuggest like Google in Solr using Solarium Client.

2014-03-17 Thread Michael McCandless

I think it's best to use one of the many autosuggesters Lucene/Solr provide?

E.g. AnalyzingInfixSuggester is running here:
http://jirasearch.mikemccandless.com

But that's just one suggester... there are many more.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Mar 17, 2014 at 10:44 AM, Sohan Kalsariya
sohankalsar...@gmail.com wrote:
 Can anyone suggest me the best practices how to do SpellCheck and
 AutoSuggest in solarium.
 Can anyone give me example for that?


 --
 Regards,
 *Sohan Kalsariya*

Re: Join Scoring

2014-02-13 Thread Michael McCandless

I suspect (not certain) one reason for the performance difference with
Solr vs Lucene joins is that Solr operates on a top-level reader?

This results in fast joins, but it means whenever you open a new
reader (NRT reader) there is a high cost to regenerate the top-level
data structures.

But if the app doesn't open NRT readers, or opens them rarely, perhaps
that cost is a good tradeoff to get faster joins.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 13, 2014 at 12:10 AM, anand chandak
anand.chan...@oracle.com wrote:
 Re-posting...



 Thanks,

 Anand



 On 2/12/2014 10:55 AM, anand chandak wrote:

 Thanks David, really helpful response.

 You mentioned that if we have to add scoring support in solr then a
 possible approach would be to add a custom QueryParser, which might be
 taking Lucene's JOIN module.  I have tired this approach and this makes it
 slow, because I believe this is making more searches..

 Curious, if it is possible instead to enhance existing solr's
 JoinQParserPlugin and add the the scoring support in the same class ? Do you
 think its feasible and recommended ? If yes, what would it take (highlevel)
 - in terms of code changes, any pointers ?


 Thanks,

 Anand


 On 2/12/2014 10:31 AM, David Smiley (@MITRE.org) wrote:

 Hi Anand.

 Solr's JOIN query, {!join}, constant-scores.  It's simpler and faster and
 more memory efficient (particularly the worse-case memory use) to
 implement
 the JOIN query without scoring, so that's why.  Of course, you might want
 it
 to score and pay whatever penalty is involved.  For that you'll need to
 write a Solr QueryParser that might use Lucene's join module which
 has
 scoring variants.  I've taken this approach before.  You asked a specific
 question about the purpose of JoinScorer when it doesn't actually score.
 Lucene's Query produces a Weight which in turn produces a Scorer
 that
 is a DocIdSetIterator plus it returns a score.  So Queries have to have a
 Scorer to match any document even if the score is always 1.

 Solr does indeed have a lot of caching; that may be in play here when
 comparing against a quick attempt at using Lucene directly.  In
 particular,
 the matching documents are likely to end up in Solr's DocumentCache.
 Returning stored fields that come back in search results are one of the
 more
 expensive things Lucene/Solr does.

 I also think you noted that the fields on documents from the from side
 of
 the query are not available to be returned in search results, just the
 to
 side.  Yup; that's true.  To remedy this, you might write a Solr
 SearchComponent that adds fields from the from side.  That could be
 tricky
 to do; it would probably need to re-run the from-side query but filtered
 to
 the matching top-N documents being returned.

 ~ David


 anand chandak wrote

 Resending, if somebody can please respond.


 Thanks,

 Anand


 On 2/5/2014 6:26 PM, anand chandak wrote:
 Hi,

 Having a question on join score, why doesn't the solr join query return
 the scores. Looking at the code, I see there's JoinScorer defined in
 the  JoinQParserPlugin class ? If its not used for scoring ? where is it
 actually used.

 Also, to evaluate the performance of solr join plugin vs lucene
 joinutil, I filed same join query against same data-set and same schema
 and in the results, I am always seeing the Qtime for Solr much lower
 then lucenes. What is the reason behind this ?  Solr doesn't return
 scores could that cause so much difference ?

 My guess is solr has very sophisticated caching mechanism and that might
 be coming in play, is that true ? or there's difference in the way JOIN
 happens in the 2 approach.

 If I understand correctly both the implementation are using 2 pass
 approach - first all the terms from fromField and then returns all
 documents that have matching terms in a toField

 If somebody can throw some light, would highly appreciate.

 Thanks,
 Anand




 -
   Author:
 http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Join-Scoring-tp4115539p4116818.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Lucene Join

2014-01-30 Thread Michael McCandless

Look in lucene's join module?

Mike McCandless

http://blog.mikemccandless.com


On Thu, Jan 30, 2014 at 4:15 AM, anand chandak anand.chan...@oracle.com wrote:
 Hi,


 I am trying to find whether the lucene joins (not solr join) if they are
 using any filter cache. The API that lucene uses is for joining
 joinutil.createjoinquery(), where can I find the source code for this API.


 Thanks in advance

 Thanks,

 Anand

Re: background merge hit exception while optimizing index (SOLR 4.4.0)

2014-01-13 Thread Michael McCandless

Which version of Java are you using?

That root cause exception is somewhat spooky: it's in the
ByteBufferIndexCode that handles an UnderflowException, ie when a
small (maybe a few hundred bytes) read happens to span the 1 GB page
boundary, and specifically the exception happens on the final read
(curBuf.get(b, offset, len)).  Such page-spanning reads are very rare.

The code looks fine to me though, and it's hard to explain how NPE (b
= null) could happen: that byte array is allocated in the
Lucene41PostingsReader.BlockDocsAndPositionsEnum class's ctor: encoded
= new byte[MAX_ENCODED_SIZE].

Separately, you really should not have to optimize daily, if ever.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Jan 13, 2014 at 2:06 AM, Ralf Matulat ralf.matu...@bundestag.de wrote:
 Hi,
 I am currently running into merge-issues while optimizing an index.

 To give you some informations:

 We are running 4 SOLR Servers with identical OS, VM-Hardware, RAM etc.
 Only one Server by now is having issues, the others are fine.

 We are running SOLR 4.4.0 with Tomcat 6.0
 It was running since October without any problems.
 The problems first occur after doing a minor change in the synonyms.txt, but
 I guess that was just a coincedence.

 We added `ulimit -v unlimited` to our tomcat init-script years ago.

 We have 4 Cores running on each SOLR Server, configuration, index-sizes of
 all 4 servers are identical (we are distributing cfgs via git).

 We did a rebuild of the index twice: First time without removing the old
 index files, second time deleting the data dir and starting from scratch.

 We are working with DIH, getting data from a MySQL DB.
 After an initial complete index-run, the optimize is working. The optimize
 fails one or two days later.

 We are doing one optimize-run a day, the index contains ~10 millions
 documents, the index size on disc is ~39GB while having 127G of free disc
 space.

 We have a mergeFactor of 3.

 The solr.log says:

 ERROR - 2014-01-12 22:47:11.062; org.apache.solr.common.SolrException;
 java.io.IOException: background merge hit exception: _dc8(4.4):C9876366/1327
 _e8u(4.4):C4250/7 _f4a(4.4):C1553/13 _fj6(4.4
 ):C1903/15 _ep3(4.4):C1217/42 _fle(4.4):C256/7 _flf(4.4):C11 into _flg
 [maxNumSegments=1]
 at
 org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1714)
 at
 org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1650)
 at
 org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:530)
 at
 org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95)
 at
 org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
 at
 org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1235)
 at
 org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1219)
 at
 org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:157)
 at
 org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:69)
 at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
 at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)
 at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
 at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
 at
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
 at
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
 at
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
 at java.lang.Thread.run(Thread.java:735)
 Caused by: java.lang.NullPointerException
 at java.nio.ByteBuffer.get(ByteBuffer.java:661)

Re: background merge hit exception while optimizing index (SOLR 4.4.0)

2014-01-13 Thread Michael McCandless

I have trouble understanding J9's version strings ... but, is it
really from 2008?  You could be hitting a JVM bug; can you test
upgrading?

I don't have much experience with Solr faceting on optimized vs
unoptimized indices; maybe someone else can answer your question.

Lucene's facet module (not yet exposed through Solr) performance
shouldn't change much for optimized vs unoptimized indices.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Jan 13, 2014 at 10:09 AM, Ralf Matulat
ralf.matu...@bundestag.de wrote:
 java -version
 java version 1.6.0
 Java(TM) SE Runtime Environment (build
 pxa6460sr3ifix-20090218_02(SR3+IZ43791+IZ43798))
 IBM J9 VM (build 2.4, J2RE 1.6.0 IBM J9 2.4 Linux amd64-64
 jvmxa6460-20081105_25433 (JIT enabled, AOT enabled)
 J9VM - 20081105_025433_LHdSMr
 JIT  - r9_20081031_1330
 GC   - 20081027_AB)
 JCL  - 20090218_01

 A question regarding to optimizing the index:
 As of SOLR 3.X we encountered massive performance improvements with facettet
 queries after optimizing an index. So we once started optimizing the indexes
 on a daily basis.
 With SOLR 4.X and the new index-format that is not true anymore?

 Btw: The checkIndex failed with 'java.io.FileNotFoundException:', I guess
 because I did not stopped the tomcat while checking. So SOLR created, merged
 and deleted some segments while checking. I will restart the check after
 stoppimg SOLR.

 Kind regards
 Ralf Matulat



 Which version of Java are you using?

 That root cause exception is somewhat spooky: it's in the
 ByteBufferIndexCode that handles an UnderflowException, ie when a
 small (maybe a few hundred bytes) read happens to span the 1 GB page
 boundary, and specifically the exception happens on the final read
 (curBuf.get(b, offset, len)).  Such page-spanning reads are very rare.

 The code looks fine to me though, and it's hard to explain how NPE (b
 = null) could happen: that byte array is allocated in the
 Lucene41PostingsReader.BlockDocsAndPositionsEnum class's ctor: encoded
 = new byte[MAX_ENCODED_SIZE].

 Separately, you really should not have to optimize daily, if ever.

 Mike McCandless

 http://blog.mikemccandless.com

Re: MergePolicy for append-only indices?

2014-01-08 Thread Michael McCandless

On Mon, Jan 6, 2014 at 3:42 PM, Michael Sokolov
msoko...@safaribooksonline.com wrote:
 I think the key optimization when there are no deletions is that you don't
 need to renumber documents and can bulk-copy blocks of contiguous documents,
 and that is independent of merge policy. I think :)

Merging of term vectors and stored fields will always use bulk-copy
for contiguous chunks of non-deleted docs, so for the append-only case
these will be the max chunk size and be efficient.

We have no codec that implements bulk merging for postings, which
would be interesting to pursue: in the append-only case it's possible,
and merging of postings is normally by far the most time consuming
step of a merge.

Also, no RAM will be used holding the doc mapping, since the docIDs
don't change.

These benefits are independent of the MergePolicy.

I think TieredMergePolicy will work fine for append-only; I'm not sure
how you'd improve on its approach.  It will in general renumber the
docs, so if that's a problem, apps should use LogByteSizeMP.

Mike McCandless

http://blog.mikemccandless.com

Re: Possible memory leak after segment merge? (related to DocValues?)

2013-12-31 Thread Michael McCandless

On Mon, Dec 30, 2013 at 1:22 PM, Greg Preston
gpres...@marinsoftware.com wrote:
 That was it.  Setting omitNorms=true on all fields fixed my problem.
  I left it indexing all weekend, and heap usage still looks great.

Good!

 I'm still not clear why bouncing the solr instance freed up memory,
 unless the in-memory structure for this norms data is lazily loaded
 somehow.

In fact it is lazily loaded, the first time a search (well,
Similarity) needs to load the norms for scoring.

 Anyway, thank you very much for the suggestion.

You're welcome.

Mike McCandless

http://blog.mikemccandless.com

Re: Possible memory leak after segment merge? (related to DocValues?)

2013-12-27 Thread Michael McCandless

Likely this is for field norms, which use doc values under the hood.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Dec 26, 2013 at 5:03 PM, Greg Preston
gpres...@marinsoftware.com wrote:
 Does anybody with knowledge of solr internals know why I'm seeing
 instances of Lucene42DocValuesProducer when I don't have any fields
 that are using DocValues?  Or am I misunderstanding what this class is
 for?

 -Greg


 On Mon, Dec 23, 2013 at 12:07 PM, Greg Preston
 gpres...@marinsoftware.com wrote:
 Hello,

 I'm loading up our solr cloud with data (from a solrj client) and
 running into a weird memory issue.  I can reliably reproduce the
 problem.

 - Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
 - 24 solr nodes (one shard each), spread across 3 physical hosts, each
 host has 256G of memory
 - index and tlogs on ssd
 - Xmx=7G, G1GC
 - Java 1.7.0_25
 - schema and solrconfig.xml attached

 I'm using composite routing to route documents with the same clientId
 to the same shard.  After several hours of indexing, I occasionally
 see an IndexWriter go OOM.  I think that's a symptom.  When that
 happens, indexing continues, and that node's tlog starts to grow.
 When I notice this, I stop indexing, and bounce the problem node.
 That's where it gets interesting.

 Upon bouncing, the tlog replays, and then segments merge.  Once the
 merging is complete, the heap is fairly full, and forced full GC only
 helps a little.  But if I then bounce the node again, the heap usage
 goes way down, and stays low until the next segment merge.  I believe
 segment merges are also what causes the original OOM.

 More details:

 Index on disk for this node is ~13G, tlog is ~2.5G.
 See attached mem1.png.  This is a jconsole view of the heap during the
 following:

 (Solr cloud node started at the left edge of this graph)

 A) One CPU core pegged at 100%.  Thread dump shows:
 Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
 nid=0x7a74 runnable [0x7f5a41c5f000]
java.lang.Thread.State: RUNNABLE
 at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
 at 
 org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
 at 
 org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
 at 
 org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
 at 
 org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
 at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
 at 
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
 at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
 at 
 org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
 at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

 B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
 memory freed.  Thread dump shows:
 Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
 nid=0x7a74 runnable [0x7f5a41c5f000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
 at 
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
 at 
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
 at 
 org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
 at 
 org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
 at 
 org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
 at 
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
 at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
 at 
 org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
 at 
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

 C) One CPU core pegged at 100%.  Manually triggered GC.  No memory
 freed.  Thread dump shows:
 Lucene Merge Thread #0 daemon prio=10 tid=0x7f5a3c064800
 nid=0x7a74 runnable [0x7f5a41c5f000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
 at 
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108)
 at 
 org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
 at 
 org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
 at 
 org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
 at

Re: Problems with gaps removed with SynonymFilter

2013-09-23 Thread Michael McCandless

Unfortunately the current SynonymFilter cannot handle posInc != 1 ...
we could perhaps try to fix this ... patches welcome :)

So for now it's best to place SynonymFilter before StopFilter, and
before any other filters that may create graph tokens (posLen  1,
posInc == 0).

Mike McCandless

http://blog.mikemccandless.com


On Mon, Sep 23, 2013 at 2:45 AM,  david.dav...@correo.aeat.es wrote:
 Hi,

 I am having a problem applying StopFilterFactory and
 SynonimFilterFactory. The problem is that SynonymFilter removes the gaps
 that were previously put by the StopFilterFactory. I'm applying filters in

 query time, because users need to change synonym lists frequently.

 This is my schema, and an example of the issue:


 String: documentacion para agentes

 org.apache.solr.analysis.WhitespaceTokenizerFactory
 {luceneMatchVersion=LUCENE_35}
 position1   2   3
 term text   documentaciónpara   agentes
 startOffset 0   14  19
 endOffset   13  18  26
 org.apache.solr.analysis.LowerCaseFilterFactory
 {luceneMatchVersion=LUCENE_35}
 position1   2   3
 term text   documentaciónpara   agentes
 startOffset 0   14  19
 endOffset   13  18  26
 org.apache.solr.analysis.StopFilterFactory {words=stopwords_intranet.txt,
 ignoreCase=true, enablePositionIncrements=true,
 luceneMatchVersion=LUCENE_35}
 position1   3
 term text   documentación   agentes
 startOffset 0   19
 endOffset   13  26
 org.apache.solr.analysis.SynonymFilterFactory
 {synonyms=sinonimos_intranet.txt, expand=true, ignoreCase=true,
 luceneMatchVersion=LUCENE_35}
 position1   2
 term text   documentación   agente
 archivo agentes
 typeSYNONYM SYNONYM
 SYNONYM SYNONYM
 startOffset 0   19
 0   19
 endOffset 1326
 13  26


 As you can see, the position should be 1 and 3, but SynonymFilter removes
 the gap and moves token from position 3 to 2
 I've got the same problem with Solr 3.5 y 4.0.
 I don't know if it's a bug or an error with my configuration. In other
 schemas that I have worked with, I had always put the SynonymFilter
 previous to StopFilter, but in this I prefered using this order because of

 the big number of synonym that the list has (i.e. I don't want to generate

 a lot of synonyms for a word that I really wanted to remove).

 Thanks,

 David Dávila Atienza
 AEAT - Departamento de Informática Tributaria

 David Dávila Atienza
 AEAT - Departamento de Informática Tributaria
 Subdirección de Tecnologías de Análisis de la Información e Investigación
 del Fraude
 Área de Infraestructuras
 Teléfono: 915831543
 Extensión: 31543

Re: Why solr 4.0 use FSIndexOutput to write file, otherwise MMap/NIO

2013-06-28 Thread Michael McCandless

Output is quite a bit simpler than input because all we do is write a
single stream of bytes with no seeking (append only), and it's done
with only one thread, so I don't think there'd be much to gain by
using the newer IO APIs for writing...

Mike McCandless

http://blog.mikemccandless.com

On Fri, Jun 28, 2013 at 2:23 AM, Jeffery Wang
jeffery.w...@morningstar.com wrote:

 I have checked the FSDirectory, it will create MMapDirectory or 
 NIOFSDirectory for Directory.
 This two directory only supply IndexInput extend for read file 
 (MMapIndexInput extends ByteBufferIndexInput),
 why not there is not MMap/NIO IndexOutput extend for file write. It only use 
 FSIndexOutput for file write(FSIndexOutput extends BufferedIndexOutput).

 Does FSIndexOutput wirte file very slow than MMap/NIO? How to improve the IO 
 write performance.

 Thanks,
 __
 Jeffery Wang
 Application Service - Backend
 Morningstar (Shenzhen) Ltd.
 Morningstar. Illuminating investing worldwide.
 +86 755 3311 0220 Office
 +86 130 7782 2813 Mobile
 jeffery.w...@morningstar.commailto:jeffery.w...@morningstar.com
 This e-mail contains privileged and confidential information and is intended 
 only for the use of the person(s) named above. Any dissemination, 
 distribution or duplication of this communication without prior written 
 consent from Morningstar is strictly prohibited. If you received this message 
 in error please contact the sender immediately and delete the materials from 
 any computer.

Re: TieredMergePolicy reclaimDeletesWeight

2013-06-19 Thread Michael McCandless

The default is 2.0, and higher values will more strongly favor merging
segments with deletes.

I think 20.0 is likely way too high ... maybe try 3-5?


Mike McCandless

http://blog.mikemccandless.com


On Tue, Jun 18, 2013 at 6:46 PM, Petersen, Robert
robert.peter...@mail.rakuten.com wrote:
 Hi

 In continuing a previous conversation, I am attempting to not have to do 
 optimizes on our continuously updated index in solr3.6.1 and I came across 
 the mention of the reclaimDeletesWeight setting in this blog: 
 http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

 We do a *lot* of deletes in our index so I want to make the merges be more 
 aggressive on reclaiming deletes, but I am having trouble finding much out 
 about this setting.  Does anyone have experience with this setting?  Would 
 the below accomplish what I want ie for it to go after deletes more 
 aggressively than normal?  I got the impression 10.0 was the default from 
 looking at this code but I could be wrong:
 https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessfulBuild/clover-report/org/apache/lucene/index/TieredMergePolicy.html?id=3085

 mergePolicy class=org.apache.lucene.index.TieredMergePolicy
   int name=maxMergeAtOnce20/int
   int name=segmentsPerTier8/int
   double name=reclaimDeletesWeight20.0/double
 /mergePolicy

 Thanks

 Robert (Robi) Petersen
 Senior Software Engineer
 Search Department

Re: TieredMergePolicy reclaimDeletesWeight

2013-06-19 Thread Michael McCandless

Way too high would cause it to pick highly lopsided merges just
because a few deletes were removed.

Highly lopsided merges (e.g. one big segment and N tiny segments) can
be horrible because it can lead to O(N^2) merge cost over time.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jun 19, 2013 at 1:36 PM, Petersen, Robert
robert.peter...@mail.rakuten.com wrote:
 OK thanks, will do.  Just out of curiosity, what would having that set way 
 too high do?  Would the index become fragmented or what?

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Wednesday, June 19, 2013 9:33 AM
 To: solr-user@lucene.apache.org
 Subject: Re: TieredMergePolicy reclaimDeletesWeight

 The default is 2.0, and higher values will more strongly favor merging 
 segments with deletes.

 I think 20.0 is likely way too high ... maybe try 3-5?


 Mike McCandless

 http://blog.mikemccandless.com


 On Tue, Jun 18, 2013 at 6:46 PM, Petersen, Robert 
 robert.peter...@mail.rakuten.com wrote:
 Hi

 In continuing a previous conversation, I am attempting to not have to
 do optimizes on our continuously updated index in solr3.6.1 and I came
 across the mention of the reclaimDeletesWeight setting in this blog:
 http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-mer
 ges.html

 We do a *lot* of deletes in our index so I want to make the merges be more 
 aggressive on reclaiming deletes, but I am having trouble finding much out 
 about this setting.  Does anyone have experience with this setting?  Would 
 the below accomplish what I want ie for it to go after deletes more 
 aggressively than normal?  I got the impression 10.0 was the default from 
 looking at this code but I could be wrong:
 https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessfulB
 uild/clover-report/org/apache/lucene/index/TieredMergePolicy.html?id=3
 085

 mergePolicy class=org.apache.lucene.index.TieredMergePolicy
   int name=maxMergeAtOnce20/int
   int name=segmentsPerTier8/int
   double name=reclaimDeletesWeight20.0/double
 /mergePolicy

 Thanks

 Robert (Robi) Petersen
 Senior Software Engineer
 Search Department

Re: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-06-15 Thread Michael McCandless

You could also try the new[ish] PostingsHighlighter:
http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html

Mike McCandless

http://blog.mikemccandless.com


On Sat, Jun 15, 2013 at 8:50 AM, Michael Sokolov
msoko...@safaribooksonline.com wrote:
 If you have very large documents (many MB) that can lead to slow
 highlighting, even with FVH.

 See https://issues.apache.org/jira/browse/LUCENE-3234

 and try setting phraseLimit=1 (or some bigger number, but not infinite,
 which is the default)

 -Mike



 On 6/14/13 4:52 PM, Andy Brown wrote:

 Bryan,

 For specifics, I'll refer you back to my original email where I
 specified all the fields/field types/handlers I use. Here's a general
 overview.
   I really only have 3 fields that I index and search against: name,
 description, and content. All of which are just general text
 (string) fields. I have a catch-all field called text that is only
 used for querying. It's indexed but not stored. The name,
 description, and content fields are copied into the text field.
   For partial word matching, I have 4 more fields: name_par,
 description_par, content_par, and text_par. The text_par field
 has the same relationship to the *_par fields as text does to the
 others (only used for querying). Those partial word matching fields are
 of type text_general_partial which I created. That field type is
 analyzed different than the regular text field in that it goes through
 an EdgeNGramFilterFactory with the minGramSize=2 and maxGramSize=7
 at index time.
   I query against both text and text_par fields using edismax deftype
 with my qf set to text^2 text_par^1 to give full word matches a higher
 score. This part returns back very fast as previously stated. It's when
 I turn on highlighting that I take the huge performance hit.
   Again, I'm using the FastVectorHighlighting. The hl.fl is set to name
 name_par description description_par content content_par so that it
 returns highlights for full and partial word matches. All of those
 fields have indexed, stored, termPositions, termVectors, and termOffsets
 set to true.
   It all seems redundant just to allow for partial word
 matching/highlighting but I didn't know of a better way. Does anything
 stand out to you that could be the culprit? Let me know if you need any
 more clarification.
   Thanks!
   - Andy

 -Original Message-
 From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
 Sent: Wednesday, May 29, 2013 5:44 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Slow Highlighter Performance Even Using
 FastVectorHighlighter

 Andy,

 I don't understand why it's taking 7 secs to return highlights. The

 size

 of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set

 to

 1024 for this verification purpose and that should be more than

 enough.

 The processor is plenty powerful enough as well.

 Running VisualVM shows all my CPU time being taken by mainly these 3
 methods:


 org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI

 nfo.getStartOffset()

 org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI

 nfo.getStartOffset()

 org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(

 )

 That is a strange and interesting set of things to be spending most of
 your CPU time on. The implication, I think, is that the number of term
 matches in the document for terms in your query (or, at least, terms
 matching exact words or the beginning of phrases in your query) is
 extremely high . Perhaps that's coming from this partial word match
 you
 mention -- how does that work?

 -- Bryan

 My guess is that this has something to do with how I'm handling

 partial

 word matches/highlighting. I have setup another request handler that
 only searches the whole word fields and it returns in 850 ms with
 highlighting.

 Any ideas?

 - Andy


 -Original Message-
 From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
 Sent: Monday, May 20, 2013 1:39 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Slow Highlighter Performance Even Using
 FastVectorHighlighter

 My guess is that the problem is those 200M documents.
 FastVectorHighlighter is fast at deciding whether a match, especially

 a

 phrase, appears in a document, but it still starts out by walking the
 entire list of term vectors, and ends by breaking the document into
 candidate-snippet fragments, both processes that are proportional to

 the

 length of the document.

 It's hard to do much about the first, but for the second you could
 choose
 to expose FastVectorHighlighter's FieldPhraseList representation, and
 return offsets to the caller rather than fragments, building up your

 own

 snippets from a separate store of indexed files. This would also

 permit

 you to set stored=false, improving your memory/core size ratio,

 which

 I'm guessing could use some improving. It would require some work, and
 it
 would require you to store a

Re: How to recover from Error opening new searcher when machine crashed while indexing

2013-05-01 Thread Michael McCandless

Alas I think CheckIndex can't do much here: there is no segments file,
so you'll have to reindex from scratch.

Just to check: did you ever called commit while building the index
before the machine crashed?

Mike McCandless

http://blog.mikemccandless.com


On Tue, Apr 30, 2013 at 8:17 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
 Hi,

 Try running the CheckIndex tool.

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Apr 30, 2013 3:10 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote:

 Solr 4.0 was indexing data and the machine crashed.

 Any suggestions on how to recover my index since I don't want to delete my
 data directory?

 When I try to start it again, I get this error:
 ERROR 12:01:46,493 Failed to load Solr core: xyz.index1
 ERROR 12:01:46,493 Cause:
 ERROR 12:01:46,494 Error opening new searcher
 org.apache.solr.common.SolrException: Error opening new searcher
 at org.apache.solr.core.SolrCore.init(SolrCore.java:701)
 at org.apache.solr.core.SolrCore.init(SolrCore.java:564)
 at

 org.apache.solr.core.CassandraCoreContainer.load(CassandraCoreContainer.java:213)
 at
 com.datastax.bdp.plugin.SolrCorePlugin.activateImpl(SolrCorePlugin.java:66)
 at

 com.datastax.bdp.plugin.PluginManager$PluginInitializer.call(PluginManager.java:161)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: org.apache.solr.common.SolrException: Error opening new searcher
 at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1290)
 at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1402)
 at org.apache.solr.core.SolrCore.init(SolrCore.java:675)
 ... 9 more
 Caused by: org.apache.lucene.index.IndexNotFoundException: no segments*
 file found in NRTCachingDirectory(org.apache.lucene.store.NIOFSDirectory@
 /media/SSD/data/solr.data/rlcatalogks.prodinfo/index
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@d7581b;
 maxCacheMB=48.0 maxMergeSizeMB=4.0): files: [_73ne_nrm.cfs,
 _73ng_Lucene40_0.tip, _73nh_nrm.cfs, _73ng_Lucene40_0.tim, _73nf.fnm,
 _73n5_Lucene40_0.frq, _73ne.fdt, _73nh.fdx, _73ne_nrm.cfe, _73ne.fdx,
 _73ne_Lucene40_0.tim, _73ne.si, _73ni.fnm, _73nh_Lucene40_0.prx,
 _73ni.fdt,
 _73n5.si, _73ne_Lucene40_0.tip, _73nf_Lucene40_0.frq,
 _73nf_Lucene40_0.prx,
 _73nf_nrm.cfe, _73ne_Lucene40_0.frq, _73ng_Lucene40_0.prx,
 _73nf_Lucene40_0.tip, _73n5.fdx, _73ng_Lucene40_0.frq, _73ng.fnm,
 _73ni.fdx, _73n5.fnm, _73nf_Lucene40_0.tim, _73ni.si, _73n5.fdt,
 _73nf_nrm.cfs, _73nh_nrm.cfe, _73ni_Lucene40_0.frq, _73ng.fdx,
 _73ne_Lucene40_0.prx, _73nh.fnm, _73nh_Lucene40_0.tip,
 _73nh_Lucene40_0.tim, _73nh.si, _73n5_Lucene40_0.tip,
 _73ni_Lucene40_0.prx,
 _73n5_Lucene40_0.tim, _73nf.si, _73ng_nrm.cfe, _73n5_Lucene40_0.prx,
 _392j_42f.del, _73ng.fdt, _73ng.si, _73ni_nrm.cfe, _73n5_nrm.cfe,
 _73ni_nrm.cfs, _73nf.fdx, _73ni_Lucene40_0.tip, _73n5_nrm.cfs,
 _73ni_Lucene40_0.tim, _73nf.fdt, _73ne.fnm, _73nh.fdt,
 _73nh_Lucene40_0.frq, _73ng_nrm.cfs]


 --
 Thanks,
 -Utkarsh

Re: Bloom filters and optimized vs. unoptimized indices

2013-04-30 Thread Michael McCandless

Be sure to test the bloom postings format on your own use case ... in
my tests (heavy PK lookups) it was slower.

But to answer your question: I would expect a single segment index to
have much faster PK lookups than a multi-segment one, with and without
the bloom postings format, but bloom may make the many-segment case
faster (just be sure to test it yourself).


Mike McCandless

http://blog.mikemccandless.com


On Tue, Apr 30, 2013 at 1:05 AM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
 Hi,

 I was looking at
 http://lucene.apache.org/core/4_2_1/codecs/org/apache/lucene/codecs/bloom/BloomFilteringPostingsFormat.html
 and this piece of text:
 
 A PostingsFormat useful for low doc-frequency fields such as primary
 keys. Bloom filters are maintained in a .blm file which offers
 fast-fail for reads in segments known to have no record of the key.
 

 Is this implying that if you are doing PK lookups AND you have a large
 index (i.e. slow queries) it may actually be better to keep the index
 unoptimized, so whole index segments can be skipped?

 Thanks,
 Otis
 --
 SOLR Performance Monitoring - http://sematext.com/spm/index.html

Re: Document adds, deletes, and commits ... a question about visibility.

2013-04-15 Thread Michael McCandless

At the Lucene level, you don't have to commit before doing the
deleteByQuery, i.e. 'a' will be correctly deleted without any
intervening commit.

Mike McCandless

http://blog.mikemccandless.com

On Mon, Apr 15, 2013 at 3:57 PM, Shawn Heisey s...@elyograg.org wrote:
 Simple question first: Is there anything in SolrJ that prevents indexing
 more than 500 documents in one request? I'm not aware of anything myself,
 but a co-worker remembers running into something, so his code is restricting
 them to 490 docs.  The only related limit I'm aware of is the POST buffer
 size limit, which defaults in recent Solr versions to 2MiB.

 A more complex question: If I am doing both deletes and adds in separate
 update requests, and I want to ensure that a delete in the next request can
 delete a document that I am adding in the current one, do I need to commit
 between the two requests?  This is probably more of a Lucene question than
 Solr, but Solr is what I'm using.

 To simplify:  Let's say I start with an empty index.  I add documents a
 and b in one request ... then I send a deleteByQuery request for a c
 and e.  If I don't do a commit between these two requests, will a still
 be in the index when I commit after the second request? If so, would there
 be an easy fix?

 Thanks,
 Shawn

Re: Is Lucene's DrillSideways something suitable for Solr?

2013-03-13 Thread Michael McCandless

On Tue, Mar 12, 2013 at 11:24 PM, Yonik Seeley yo...@lucidworks.com wrote:
 On Tue, Mar 12, 2013 at 10:27 PM, Alexandre Rafalovitch
 arafa...@gmail.com wrote:
 Lucene seems to get a new DrillSideways functionality on top of its own
 facet implementation.

 I would love to have something like that in Solr

 Solr has had multi-select faceting for 4 years now.
 My understanding of DrillSideways is that it implements the same type
 of thing for Lucene faceting module (which Solr doesn't use).

 There's implementation (which DrillSideways is), and interface (which
 for Solr means tagging / excluding filters).
 If you have any ideas around improving the Solr interface for
 multi-select faceting, please share them!

Actually DrillSideways is independent of multi-select.

Ie, it's useful to have the sideways counts for a drill-down field,
whether your UI offers single or multi select for a given dimension.

DrillSideways.java is a different implementation (minShouldMatch=N-1
query w/ custom collector to separate hit from near-miss) than Solr
(tagging/excluding filters), and also a different interface.

Mike McCandless

http://blog.mikemccandless.com

Re: AW: 170G index, 1.5 billion documents, out of memory on query

2013-02-26 Thread Michael McCandless

It really should be unlimited: this setting has nothing to do with how
much RAM is on the computer.

See http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Mike McCandless

http://blog.mikemccandless.com

On Tue, Feb 26, 2013 at 12:18 PM, zqzuk ziqizh...@hotmail.co.uk wrote:
 Hi
 sorry I couldnt do this directly... the way I do this is by subscribing to a
 cluster of computers in our organisation and send the job with required
 memory. It gets randomly allocated to a node (one single server in the
 cluster) once executed and it is not possible to connect to that specific
 node to check.

 But im pretty sure it wont be unlimited but matching the figure I
 required, which was 40G (the max memory on a single node is 48G anyway). So
 Solr only gets maximum of 40G memory for this index.





 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/170G-index-1-5-billion-documents-out-of-memory-on-query-tp4042696p4043110.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: AnalyzingSuggester returning index value instead of field value?

2013-02-07 Thread Michael McCandless

I'm not very familiar with how AnalyzingSuggester works inside Solr
... if you try this directly with the Lucene APIs does it still
happen?

Hmm maybe one idea: if you remove whitespace from your suggestion does
it work?  I wonder if there's a whitespace / multi-token issue ... if
so then maybe see how TestPhraseSuggestions.java (in Solr) does this?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Feb 7, 2013 at 9:48 AM, Sebastian Saip sebastian.s...@gmail.com wrote:
 I'm looking into a way to implement an autosuggest and for my special needs
 (I'm doing a startsWith-search that should retrieve the full name, which
 may have accents - However, I want to search with/without accents and in
 any upper/lowercase for comfort)

 Here's part of my configuration: http://pastebin.com/20vSGJ1a

 So I have a name=Têst Námè and I query for test, tést, TÈST, or
 similiar. This gives me back test name as a suggestion, which looks like
 the index, rather than the actual value.

 Furthermore, when I fed the document without index-analyzers, then added
 the index-analyzers, restarted without refeeding and queried, it returned
 the right value (so this seems to retrieve the index, rather than the
 actual stored value?)

 Or maybe I just configured it the wrong way :?
 Theres not really much documentation about this yet :(

 BR Sebastian Saip

Re: get a list of terms sorted by total term frequency

2012-11-07 Thread Michael McCandless

Lucene's misc module has HighFreqTerms tool.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Nov 7, 2012 at 1:15 PM, Edward Garrett heacu.mcint...@gmail.com wrote:
 hi,

 is there a simple way to get a list of all terms that occur in a field
 sorted by their total term frequency within that field?

 TermsComponent (http://wiki.apache.org/solr/TermsComponent) provides
 fast field faceting over the whole index, but as counts it gives the
 number of documents that each term occurs in (given a field or set of
 fields). in place of document counts, i want total term frequency
 counts. the ttf function
 (http://wiki.apache.org/solr/FunctionQuery#totaltermfreq) provides
 this, but only if you know what term to pass to the function.

 edward

Re: throttle segment merging

2012-10-29 Thread Michael McCandless

With Lucene 4.0, FSDirectory now supports merge bytes/sec throttling
(FSDirectory.setMaxMergeWriteMBPerSec): it rate limits that max
bytes/sec load on the IO system due to merging.

Not sure if it's been exposed in Solr / ElasticSearch yet ...

Mike McCandless

http://blog.mikemccandless.com

On Mon, Oct 29, 2012 at 7:07 AM, Tomás Fernández Löbbe
tomasflo...@gmail.com wrote:

  Is there way to set-up logging to output something when segment merging
 runs?

  I think segment merging is logged when you enable infoStream logging
 (you
 should see it commented in the solrconfig.xml)

 no, segment merging is not logged at info level. it needs customized log
 config.


 INFO level is not the same as infoStream. See solrconfig, there is a
 commented section that talks about it, and if you uncomment it it will
 generate a file with low level Lucene logging. This file will include
 segments information, including merging.




  Can be segment merges throttled?

  You can change when and how segments are merged with the merge policy,
 maybe it's enough for you changing the initial settings (mergeFactor for
 example)?

 I am now researching elasticsearch, it can do it, its lucene 3.6 based



 I don't know if this is what you are looking for, but the TieredMergePolicy
 (default) allows you to set maximum number of segments to be merged at once
 and maximum size of segments to be created during normal merging.
 Other option is, as you said, create a Jira for a new merge policy.

 Tomás

Re: Indexing in Solr: invalid UTF-8

2012-09-26 Thread Michael McCandless

Python's unicode function takes an optional (keyword) errors
argument, telling it what to do when an invalid UTF8 byte sequence is
seen.

The default (errors='strict') is to throw the exceptions you're
seeing.  But you can also pass errors='replace' or errors='ignore'.

See http://docs.python.org/howto/unicode.html for details ...

However, I agree with Robert: you should dig into why whatever process
you used to extract the full text from your binary documents is
producing invalid UTF-8 ... something is wrong with that process.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Sep 25, 2012 at 10:44 PM, Robert Muir rcm...@gmail.com wrote:
 On Tue, Sep 25, 2012 at 2:02 PM, Patrick Oliver Glauner
 patrick.oliver.glau...@cern.ch wrote:
 Hi
 Thanks. But I see that 0xd835 is missing in this list (see my exceptions).

 What's the best way to get rid of all of them in Python? I am new to unicode 
 in Python but I am sure that this use case is quite frequent.


 I don't really know python either: so I could be wrong here but are
 you just taking these binary .PDF and .DOC files and treating them as
 UTF-8 text and sending them to Solr?

 If so, I don't think that will work very well. Maybe instead try
 parsing these binary files with something like Tika to get at the
 actual content and send that? (it seems some people have developed
 python integration for this, e.g.
 http://redmine.djity.net/projects/pythontika/wiki)

Re: offsets issues with multiword synonyms since LUCENE_33

2012-08-14 Thread Michael McCandless

See also SOLR-3390.

Some cases have been addressed.  Eg, if you match domain name system
- dns, then dns will have correct offsets spanning the full phrase
domain name system in the input.  (However: QueryParser won't work
because a query for domain name system is pre-split on whitespace so
the synonym never matches).

But for the reverse case, which I call expanding (ie, match dns -
domain name system), the results are not correct (or at least
different from the previous SynFilter impl): the three tokens are
overlapped onto subsequent tokens, resulting in highlighting the wrong
tokens. However, QueryParser will work correctly for the query
domain name system...

But, I'd like to ask: why do apps want to expand (replace a match
with more than one input token, ie the dns - domain name system
case)?  Is it ONLY because of QueryParser's limitation (that it
pre-splits on whitespace)?  Or are there other realistic use cases?

Mike McCandless

http://blog.mikemccandless.com

On Tue, Aug 14, 2012 at 11:53 AM, Marc Sturlese marc.sturl...@gmail.com wrote:
 Has someone noticed this problem and solved it somehow? (without using
 LUCENE_33 in the solrconfig.xml)
 https://issues.apache.org/jira/browse/LUCENE-3668

 Thanks in advance



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/offsets-issues-with-multiword-synonyms-since-LUCENE-33-tp4001195.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: synonym file

2012-08-03 Thread Michael McCandless

Actually FST (and SynFilter based on it) was backported to 3.x.

Mike McCandless

http://blog.mikemccandless.com

On Fri, Aug 3, 2012 at 11:28 AM, Jack Krupansky j...@basetechnology.com wrote:
 The Lucene FST guys made a big improvement in synonym filtering in
 Lucene/Solr 4.0 using FSTs. Or are you already using that?

 Or if you are stuck with pre-4.0, you could do a preprocessor that
 efficiently generates boolean queries for the synonym expansions. That
 should give you more decent query times, assuming you develop a decent
 synonym lookup filter.

 Maybe you could backport the 4.0 FST code, or at least use the same
 techniques for your own preprocessor.

 -- Jack Krupansky

 -Original Message- From: Peyman Faratin
 Sent: Friday, August 03, 2012 12:56 AM
 To: solr-user@lucene.apache.org
 Subject: synonym file


 Hi

 I have a (23M) synonym file that takes a long time (3 or so minutes) to load
 and once included seems to adversely affect the QTime of the application by
 approximately 4 orders of magnitude.

 Any advise on how to load faster and lower the QT would be much appreciated.

 best

 Peyman=

Re: Near Real Time Indexing and Searching with solr 3.6

2012-07-03 Thread Michael McCandless

Hi,

You might want to take a look at Solr's trunk (very soon to be 4.0.0
alpha release), which already has a near-real-time solution (using
Lucene's near-real-time APIs).

Lucene has NRTCachingDirectory (to use RAM for small / recently
flushed segments), but I don't think Solr uses it yet.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jul 3, 2012 at 4:02 AM, thomas tho...@codemium.com wrote:
 Hi,

 As part of my bachelor thesis I'm trying to archive NRT with Solr 3.6. I've
 came up with a basic concept and would be trilled if I could get some
 feedback.

 The main idea is to use two different Indexes. One persistent on disc and
 one in RAM. The plan is to route every added and modified document to the
 RAMIndex (http://imgur.com/kLfUN). After a certain period of time, this
 index would get cleared and the documents get added to the persistent Index.

 Some major problems I still have with this idea is:
 - deletions of documents from documents in the persistent index
 - having the same unique IDs in both the RAM index and persitent Index, as a
 result of an updated document
   - Merging search results to filter out old versions of updated documents

 Would such an idea be viable to persuit?

 Thanks for you time

Re: leap second bug

2012-07-01 Thread Michael McCandless

Looks like this is a low-level Linux issue ... see Shay's email to the
ElasticSearch list about it:

https://groups.google.com/forum/?fromgroups#!topic/elasticsearch/_I1_OfaL7QY

Also see the comments here:

 http://news.ycombinator.com/item?id=4182642

Mike McCandless

http://blog.mikemccandless.com

On Sun, Jul 1, 2012 at 8:08 AM, Óscar Marín Miró
oscarmarinm...@gmail.com wrote:
 Hello Michael, thanks for the note :)

 I'm having a similar problem since yesterday, tomcats are wild on CPU [near
 100%]. Did your solr servers did not reply to index/query requests?

 Thanks :)

 On Sun, Jul 1, 2012 at 1:22 PM, Michael Tsadikov 
 mich...@myheritage.comwrote:

 Our solr servers went into GC hell, and became non-responsive on date
 change today.

 Restarting tomcats did not help.

 Rebooting the machine did.


 http://www.wired.com/wiredenterprise/2012/07/leap-second-bug-wreaks-havoc-with-java-linux/




 --
 Whether it's science, technology, personal experience, true love,
 astrology, or gut feelings, each of us has confidence in something that we
 will never fully comprehend.
  --Roy H. William

Re: Exception when optimizing index

2012-06-18 Thread Michael McCandless

Is it possible the Linux machine has bad RAM / bad disk?

Mike McCandless

http://blog.mikemccandless.com

On Mon, Jun 18, 2012 at 7:06 AM, Erick Erickson erickerick...@gmail.com wrote:
 Is it possible that you somehow have some problem with jars and classpath?
 I'm wondering because this problem really seems odd, and you've eliminated
 a bunch of possibilities. I'm wondering if you've somehow gotten some old
 jars mixed in the bunch.

 Or, alternately, what about re-installing Solr on the theory that somehow you
 got a bad download and/or files (i.e. the Solr jar files) got
 corrupted, your disk has
 a bad spot or.

 Really clutching at straws here

 Erick

 On Mon, Jun 18, 2012 at 3:44 AM, Rok Rejc rokrej...@gmail.com wrote:
 Hi all,

 during the last days, I have create solr instance on a windows environment
 - same Solr as on the linux machine (solr 4.0 from 9th June 2012), same
 solr configurations, Tomcat 6, Java 6u23.
 I have also upgraded Java on the linux machine (1.7.0_05-b05 from Oracle).

 Import and optimize on the windows machine worked without any issue, but on
 the linux machine optimize fails with the same exception:

 Caused by: java.io.IOException: Invalid vInt detected (too many bits)
    at
 org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:217)
 ...

 after that I have also change directory factory (on the linux machine) to
 SimpleFSDirectoryFactory. I have reindexed all the documents and again run
 the optimize - it fails again with the same expcetion.

 In the next steps I could maybe do partial insertions (which will be a
 painful process), but after that I'm out of ideas (and out of time for
 experimenting).

 Many thanks for further suggestions.

 Rok



 On Wed, Jun 13, 2012 at 1:31 PM, Robert Muir rcm...@gmail.com wrote:

 On Thu, Jun 7, 2012 at 5:50 AM, Rok Rejc rokrej...@gmail.com wrote:
    - java.runtime.nameOpenJDK Runtime Environment
    - java.runtime.version1.6.0_22-b22
 ...
 
  As far as I see from the JIRA issue I have the patch attached (as
 mentioned
  I have a trunk version from May 12). Any ideas?
 

 its not guaranteed that the patch will workaround all hotspot bugs
 related to http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5091921

 Since you can reproduce, is it possible for you to re-test the
 scenario with a newer JVM (e.g. 1.7.0_04) just to rule that out?

 --
 lucidimagination.com

Re: field name was indexed without position data; cannot run PhraseQuery (term=a)

2012-05-24 Thread Michael McCandless

This behavior has changed.

In 3.x, you silently got no results in such cases.

In trunk, you get an exception notifying you that the query cannot run.

Mike McCandless

http://blog.mikemccandless.com

On Thu, May 24, 2012 at 6:04 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Hi,

 What is the intended behaviour for explicit phrase queries on fields without 
 position data? If a (e)dismax qf parameter included a field 
 omitTermFreqAndPositions=true user explicit phrase queries throw the 
 following error on trunk but not on the 3x branch.

 java.lang.IllegalStateException: field name was indexed without position 
 data; cannot run PhraseQuery (term=a)
        at 
 org.apache.lucene.search.PhraseQuery$PhraseWeight.scorer(PhraseQuery.java:274)
        at 
 org.apache.lucene.search.DisjunctionMaxQuery$DisjunctionMaxWeight.scorer(DisjunctionMaxQuery.java:160)
        at 
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:589)
        at 
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280)
        at 
 org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1518)
        at 
 org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1265)
        at 
 org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:384)
        at 
 org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:411)
        at 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:204)
        at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1555)
        at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442)
        at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263)
        at 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
        at 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
        at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
        at 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
        at 
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
        at 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
        at 
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
        at 
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
        at 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
        at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
        at 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
        at 
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
        at 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
        at org.eclipse.jetty.server.Server.handle(Server.java:351)
        at 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
        at 
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
        at 
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
        at 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634)
        at 
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
        at 
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
        at 
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
        at 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
        at 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
        at java.lang.Thread.run(Thread.java:662)


 Thanks

Re: field name was indexed without position data; cannot run PhraseQuery (term=a)

2012-05-24 Thread Michael McCandless

I believe termPositions=false refers to the term vectors and not how
the field is indexed (which is very confusing I think...).

I think you'll need to index a separate field disabling term freqs +
positions than the field the queryparser can query?

But ... if all of this is to just do custom scoring ... can't you just
set a custom similarity for the field and index it normally (with term
freq + positions).

Mike McCandless

http://blog.mikemccandless.com

On Thu, May 24, 2012 at 6:47 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Thanks!

 How can we, in that case, omit term frequency for a qf field? I assume the 
 way to go is to configure a custom flat term frequency similarity for that 
 field. And how can it be that this error is not thrown with 
 termPosition=false for that field but only with omitTermFreqAndPositions?

 Markus


 -Original message-
 From:Michael McCandless luc...@mikemccandless.com
 Sent: Thu 24-May-2012 12:26
 To: solr-user@lucene.apache.org
 Subject: Re: field quot;namequot; was indexed without position data; 
 cannot run PhraseQuery (term=a)

 This behavior has changed.

 In 3.x, you silently got no results in such cases.

 In trunk, you get an exception notifying you that the query cannot run.

 Mike McCandless

 http://blog.mikemccandless.com

 On Thu, May 24, 2012 at 6:04 AM, Markus Jelsma
 markus.jel...@openindex.io wrote:
  Hi,
 
  What is the intended behaviour for explicit phrase queries on fields 
  without position data? If a (e)dismax qf parameter included a field 
  omitTermFreqAndPositions=true user explicit phrase queries throw the 
  following error on trunk but not on the 3x branch.
 
  java.lang.IllegalStateException: field name was indexed without position 
  data; cannot run PhraseQuery (term=a)
         at 
  org.apache.lucene.search.PhraseQuery$PhraseWeight.scorer(PhraseQuery.java:274)
         at 
  org.apache.lucene.search.DisjunctionMaxQuery$DisjunctionMaxWeight.scorer(DisjunctionMaxQuery.java:160)
         at 
  org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:589)
         at 
  org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:280)
         at 
  org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1518)
         at 
  org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1265)
         at 
  org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:384)
         at 
  org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:411)
         at 
  org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:204)
         at 
  org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1555)
         at 
  org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442)
         at 
  org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263)
         at 
  org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
         at 
  org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
         at 
  org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
         at 
  org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
         at 
  org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
         at 
  org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
         at 
  org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
         at 
  org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
         at 
  org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
         at 
  org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
         at 
  org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
         at 
  org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
         at 
  org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
         at org.eclipse.jetty.server.Server.handle(Server.java:351)
         at 
  org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
         at 
  org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
         at 
  org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
         at 
  org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
         at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634)
         at 
  org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
         at

Re: question about NRT(soft commit) and Transaction Log in trunk

2012-05-06 Thread Michael McCandless

This is a good question...

I don't know much about how Solr's transaction log works, but, peeking
in the code, I do see it fsync'ing (look in TransactionLog.java, in
the finish method), but only if the SyncLevel is FSYNC.

If the default is really flush, I don't see how the transaction log
helps on recovery...?

Should we change the default ot FSYNC?

Mike McCandless

http://blog.mikemccandless.com


On Sat, Apr 28, 2012 at 7:11 AM, Li Li fancye...@gmail.com wrote:
 hi
   I checked out the trunk and played with its new soft commit
 feature. it's cool. But I've got a few questions about it.
   By reading some introductory articles and wiki, and hasted code
 reading, my understand of it's implementation is:
   For normal commit(hard commit), we should flush all into disk and
 commit it. flush is not very time consuming because of
 os level cache. the most time consuming one is sync in commit process.
   Soft commit just flush postings and pending deletions into disk
 and generating new segments. Then solr can use a
 new searcher to read the latest indexes and warm up and then register itself.
   if there is no hard commit and the jvm crashes, then new data may lose.
   if my understanding is correct, then why we need transaction log?
   I found in DirectUpdateHandler2, every time a command is executed,
 TransactionLog will record a line in log. But the default
 sync level in RunUpdateProcessorFactory is flush, which means it will
 not sync the log file. does this make sense?
   in database implementation, we usually write log and modify data
 in memory because log is smaller than real data. if crashes.
 we can redo the unfinished log and make data correct. will Solr
 leverage this log like this? if it is, why it's not synced?

Re: SOLR 3.5 Index Optimization not producing single .cfs file

2012-05-03 Thread Michael McCandless

By default, the default merge policy (TieredMergePolicy) won't create
the CFS if the segment is very large ( 10% of the total index
size).  Likely that's what you are seeing?

If you really must have a CFS (how come?) then you can call
TieredMergePolicy.setNOCFSRatio(1.0) -- not sure how/where this is
exposed in Solr though.

LogMergePolicy also has the same behaviour/method...

Mike McCandless

http://blog.mikemccandless.com

On Thu, May 3, 2012 at 5:18 AM, pravesh suyalprav...@yahoo.com wrote:
 Hi,

 I've migrated the search servers to the latest stable release (SOLR-3.5)
 from SOLR-1.4.1.
 We've fully recreated the index for this. After index completes, when im
 optimizing the index then it is not merging the index into a single .cfs
 file as was being done with 1.4.1 version.

 We've set the , useCompoundFiletrue/useCompoundFile

 Is it something related to the new MergePolicy being used with SOLR 3.x
 onwards (I suppose it is TieredMergePolicy with 3.x version)? If yes should
 i change it to the LogByteSizeMergePolicy?

 Does this change requires complete rebuilt OR will do incrementally?


 Regards
 Pravesh



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SOLR-3-5-Index-Optimization-not-producing-single-cfs-file-tp3958619.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Large Index and OutOfMemoryError: Map failed

2012-04-22 Thread Michael McCandless

Is it possible you are hitting this (just opened) Solr issue?:

https://issues.apache.org/jira/browse/SOLR-3392

Mike McCandless

http://blog.mikemccandless.com

On Fri, Apr 20, 2012 at 9:33 AM, Gopal Patwa gopalpa...@gmail.com wrote:
 We cannot avoid auto soft commit, since we need Lucene NRT feature. And I
 use StreamingUpdateSolrServer for adding/updating index.

 On Thu, Apr 19, 2012 at 7:42 AM, Boon Low boon@brightsolid.com wrote:

 Hi,

 Also came across this error recently, while indexing with  10 DIH
 processes in parallel + default index setting. The JVM grinds to a halt and
 throws this error. Checking the index of a core reveals thousands of files!
 Tuning the default autocommit from 15000ms to 90ms solved the problem
 for us. (no 'autosoftcommit').

 Boon

 -
 Boon Low
 Search UX and Engine Developer
 brightsolid Online Publishing

 On 14 Apr 2012, at 17:40, Gopal Patwa wrote:

  I checked it was MMapDirectory.UNMAP_SUPPORTED=true and below are my
  system data. Is their any existing test case to reproduce this issue? I
 am
  trying understand how I can reproduce this issue with unit/integration
 test
 
  I will try recent solr trunk build too,  if it is some bug in solr or
  lucene keeping old searcher open then how to reproduce it?
 
  SYSTEM DATA
  ===
  PROCESSOR: Intel(R) Xeon(R) CPU E5504 @ 2.00GHz
  SYSTEM ID: x86_64
  CURRENT CPU SPEED: 1600.000 MHz
  CPUS: 8 processor(s)
  MEMORY: 49449296 kB
  DISTRIBUTION: CentOS release 5.3 (Final)
  KERNEL NAME: 2.6.18-128.el5
  UPTIME: up 71 days
  LOAD AVERAGE: 1.42, 1.45, 1.53
  JBOSS Version: Implementation-Version: 4.2.2.GA (build:
  SVNTag=JBoss_4_2_2_GA date=20
  JAVA Version: java version 1.6.0_24
 
 
  On Thu, Apr 12, 2012 at 3:07 AM, Michael McCandless 
  luc...@mikemccandless.com wrote:
 
  Your largest index has 66 segments (690 files) ... biggish but not
  insane.  With 64K maps you should be able to have ~47 searchers open
  on each core.
 
  Enabling compound file format (not the opposite!) will mean fewer maps
  ... ie should improve this situation.
 
  I don't understand why Solr defaults to compound file off... that
  seems dangerous.
 
  Really we need a Solr dev here... to answer how long is a stale
  searcher kept open.  Is it somehow possible 46 old searchers are
  being left open...?
 
  I don't see any other reason why you'd run out of maps.  Hmm, unless
  MMapDirectory didn't think it could safely invoke unmap in your JVM.
  Which exact JVM are you using?  If you can print the
  MMapDirectory.UNMAP_SUPPORTED constant, we'd know for sure.
 
  Yes, switching away from MMapDir will sidestep the too many maps
  issue, however, 1) MMapDir has better perf than NIOFSDir, and 2) if
  there really is a leak here (Solr not closing the old searchers or a
  Lucene bug or something...) then you'll eventually run out of file
  descriptors (ie, same  problem, different manifestation).
 
  Mike McCandless
 
  http://blog.mikemccandless.com
 
  2012/4/11 Gopal Patwa gopalpa...@gmail.com:
 
  I have not change the mergefactor, it was 10. Compound index file is
  disable
  in my config but I read from below post, that some one had similar
 issue
  and
  it was resolved by switching from compound index file format to
  non-compound
  index file.
 
  and some folks resolved by changing lucene code to disable
  MMapDirectory.
  Is this best practice to do, if so is this can be done in
 configuration?
 
 
 
 http://lucene.472066.n3.nabble.com/MMapDirectory-failed-to-map-a-23G-compound-index-segment-td3317208.html
 
  I have index document of core1 = 5 million, core2=8million and
  core3=3million and all index are hosted in single Solr instance
 
  I am going to use Solr for our site StubHub.com, see attached ls -l
  list
  of index files for all core
 
  SolrConfig.xml:
 
 
       indexDefaults
               useCompoundFilefalse/useCompoundFile
               mergeFactor10/mergeFactor
               maxMergeDocs2147483647/maxMergeDocs
               maxFieldLength1/maxFieldLength--
               ramBufferSizeMB4096/ramBufferSizeMB
               maxThreadStates10/maxThreadStates
               writeLockTimeout1000/writeLockTimeout
               commitLockTimeout1/commitLockTimeout
               lockTypesingle/lockType
 
           mergePolicy
 class=org.apache.lucene.index.TieredMergePolicy
             double name=forceMergeDeletesPctAllowed0.0/double
             double name=reclaimDeletesWeight10.0/double
           /mergePolicy
 
           deletionPolicy class=solr.SolrDeletionPolicy
             str name=keepOptimizedOnlyfalse/str
             str name=maxCommitsToKeep0/str
           /deletionPolicy
 
       /indexDefaults
 
 
       updateHandler class=solr.DirectUpdateHandler2
           maxPendingDeletes1000/maxPendingDeletes
            autoCommit
              maxTime90/maxTime
              openSearcherfalse/openSearcher
            /autoCommit
            autoSoftCommit
 
  maxTime

Re: Large Index and OutOfMemoryError: Map failed

2012-04-12 Thread Michael McCandless

 commit error...:java.io.IOException: Map
 failed
   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:748)
   at
 org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(MMapDirectory.java:293)
   at 
 org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:221)
   at
 org.apache.lucene.codecs.lucene40.Lucene40PostingsReader.init(Lucene40PostingsReader.java:58)
   at
 org.apache.lucene.codecs.lucene40.Lucene40PostingsFormat.fieldsProducer(Lucene40PostingsFormat.java:80)
   at
 org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader$1.visitOneFormat(PerFieldPostingsFormat.java:189)
   at
 org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$VisitPerFieldFile.init(PerFieldPostingsFormat.java:280)
   at
 org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader$1.init(PerFieldPostingsFormat.java:186)
   at
 org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.init(PerFieldPostingsFormat.java:186)
   at
 org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:256)
   at
 org.apache.lucene.index.SegmentCoreReaders.init(SegmentCoreReaders.java:108)
   at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:51)
   at
 org.apache.lucene.index.IndexWriter$ReadersAndLiveDocs.getReader(IndexWriter.java:494)
   at
 org.apache.lucene.index.BufferedDeletesStream.applyDeletes(BufferedDeletesStream.java:214)
   at
 org.apache.lucene.index.IndexWriter.applyAllDeletes(IndexWriter.java:2939)
   at
 org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:2930)
   at 
 org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2681)
   at
 org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2804)
   at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2786)
   at
 org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:391)
   at org.apache.solr.update.CommitTracker.run(CommitTracker.java:197)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   at

 ...

 [Message clipped]
 --
 From: Michael McCandless luc...@mikemccandless.com
 Date: Sat, Mar 31, 2012 at 3:15 AM
 To: solr-user@lucene.apache.org


 It's the virtual memory limit that matters; yours says unlimited below
 (good!), but, are you certain that's really the limit your Solr
 process runs with?

 On Linux, there is also a per-process map count:

    cat /proc/sys/vm/max_map_count

 I think it typically defaults to 65,536 but you should check on your
 env.  If a process tries to map more than this many regions, you'll
 hit that exception.

 I think you can:

  cat /proc/pid/maps | wc

 to see how many maps your Solr process currently has... if that is
 anywhere near the limit then it could be the cause.

 Mike McCandless

 http://blog.mikemccandless.com

 On Sat, Mar 31, 2012 at 1:26 AM, Gopal Patwa gopalpa...@gmail.com wrote:
 *I need help!!*

 *
 *

 *I am using Solr 4.0 nightly build with NRT and I often get this error
 during auto commit **java.lang.OutOfMemoryError:* *Map* *failed. I
 have search this forum and what I found it is related to OS ulimit
 setting, please se below my ulimit settings. I am not sure what ulimit
 setting I should have? and we also get **java.net.SocketException:*
 *Too* *many* *open* *files NOT sure how many open file we need to
 set?*


 I have 3 core with index size : core1 - 70GB, Core2 - 50GB and Core3 -
 15GB, with Single shard

 *
 *

 *We update the index every 5 seconds, soft commit every 1 second and
 hard commit every 15 minutes*

 *
 *

 *Environment: Jboss 4.2, JDK 1.6 , CentOS, JVM Heap Size = 24GB*

 *
 *

 ulimit:

 core file size          (blocks, -c) 0
 data seg size           (kbytes, -d) unlimited
 scheduling priority             (-e) 0
 file size               (blocks, -f) unlimited
 pending signals                 (-i) 401408
 max locked memory       (kbytes, -l) 1024
 max memory size         (kbytes, -m) unlimited
 open files                      (-n) 1024
 pipe size            (512 bytes, -p) 8
 POSIX message queues     (bytes, -q) 819200
 real-time priority              (-r) 0
 stack size              (kbytes, -s) 10240
 cpu time               (seconds, -t) unlimited
 max user processes              (-u) 401408
 virtual memory          (kbytes, -v) unlimited
 file locks                      (-x) unlimited


 *
 *

 *ERROR:*

 *
 *

 *2012-03-29* *15:14:08*,*560* [] *priority=ERROR* *app_name=*
 *thread=pool-3-thread-1* *location=CommitTracker* *line=93* *auto*
 *commit* *error...:java.io.IOException:* *Map* *failed*
        *at* *sun.nio.ch.FileChannelImpl.map*(*FileChannelImpl.java:748*)
        *at*
 *org.apache.lucene.store.MMapDirectory$MMapIndexInput.**init*(*MMapDirectory.java:293*)
        *at*
 *org.apache.lucene.store.MMapDirectory.openInput*(*MMapDirectory.java:221

Re: codecs for sorted indexes

2012-04-12 Thread Michael McCandless

Do you mean you are pre-sorting the documents (by what criteria?)
yourself, before adding them to the index?

In which case... you should already be seeing some benefits (smaller
index size) than had you randomly added them (ie the vInts should
take fewer bytes), I think.  (Probably the savings would be greater
for better intblock codecs like PForDelta, SimpleX, but I'm not
sure...).

Or do you mean having a codec re-sort the documents (on flush/merge)?
I think this should be possible w/ the Codec API... but nobody has
tried it yet that I know of.

Note that the bulkpostings branch is effectively dead (nobody is
iterating on it, and we've removed the old bulk API from trunk), but
there is likely a GSoC project to add a PForDelta codec to trunk:

https://issues.apache.org/jira/browse/LUCENE-3892

Mike McCandless

http://blog.mikemccandless.com



On Thu, Apr 12, 2012 at 6:13 AM, Carlos Gonzalez-Cadenas
c...@experienceon.com wrote:
 Hello,

 We're using a sorted index in order to implement early termination
 efficiently over an index of hundreds of millions of documents. As of now,
 we're using the default codecs coming with Lucene 4, but we believe that
 due to the fact that the docids are sorted, we should be able to do much
 better in terms of storage and achieve much better performance, especially
 decompression performance.

 In particular, Robert Muir is commenting on these lines here:

 https://issues.apache.org/jira/browse/LUCENE-2482?focusedCommentId=12982411page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12982411

 We're aware that the in the bulkpostings branch there are different codecs
 being implemented and different experiments being done. We don't know
 whether we should implement our own codec (i.e. using some RLE-like
 techniques) or we should use one of the codecs implemented there (PFOR,
 Simple64, ...).

 Can you please give us some advice on this?

 Thanks
 Carlos

 Carlos Gonzalez-Cadenas
 CEO, ExperienceOn - New generation search
 http://www.experienceon.com

 Mobile: +34 652 911 201
 Skype: carlosgonzalezcadenas
 LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas

Re: Large Index and OutOfMemoryError: Map failed

2012-04-11 Thread Michael McCandless

Hi,

65K is already a very large number and should have been sufficient...

However: have you increased the merge factor?  Doing so increases the
open files (maps) required.

Have you disabled compound file format?  (Hmmm: I think Solr does so
by default... which is dangerous).  Maybe try enabling compound file
format?

Can you ls -l your index dir and post the results?

It's also possible Solr isn't closing the old searchers quickly enough
... I don't know the details on when Solr closes old searchers...

Mike McCandless

http://blog.mikemccandless.com



On Tue, Apr 10, 2012 at 11:35 PM, Gopal Patwa gopalpa...@gmail.com wrote:
 Michael, Thanks for response

 it was 65K as you mention the default value for cat
 /proc/sys/vm/max_map_count . How we determine what value this should be?
  is it number of document during hard commit in my case it is 15 minutes?
 or it is number of  index file or number of documents we have in all cores.

 I have raised the number to 140K but I still get when it reaches to 140K,
 we have to restart jboss server to free up the map count, sometime OOM
 error happen during *Error opening new searcher*

 is making this number to unlimited is only solution?''


 Error log:

 *location=CommitTracker line=93 auto commit
 error...:org.apache.solr.common.SolrException: Error opening new
 searcher
        at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251)
        at 
 org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:409)
        at org.apache.solr.update.CommitTracker.run(CommitTracker.java:197)
        at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
        at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
        at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)Caused by:
 java.io.IOException: Map failed
        at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:748)
        at 
 org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(MMapDirectory.java:293)
        at 
 org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:221)
        at 
 org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$VisitPerFieldFile.init(PerFieldPostingsFormat.java:262)
        at 
 org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$1.init(PerFieldPostingsFormat.java:316)
        at 
 org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.files(PerFieldPostingsFormat.java:316)
        at org.apache.lucene.codecs.Codec.files(Codec.java:56)
        at org.apache.lucene.index.SegmentInfo.files(SegmentInfo.java:423)
        at 
 org.apache.lucene.index.SegmentInfo.sizeInBytes(SegmentInfo.java:215)
        at 
 org.apache.lucene.index.IndexWriter.prepareFlushedSegment(IndexWriter.java:2220)
        at 
 org.apache.lucene.index.DocumentsWriter.publishFlushedSegment(DocumentsWriter.java:497)
        at 
 org.apache.lucene.index.DocumentsWriter.finishFlush(DocumentsWriter.java:477)
        at 
 org.apache.lucene.index.DocumentsWriterFlushQueue$SegmentFlushTicket.publish(DocumentsWriterFlushQueue.java:201)
        at 
 org.apache.lucene.index.DocumentsWriterFlushQueue.innerPurge(DocumentsWriterFlushQueue.java:119)
        at 
 org.apache.lucene.index.DocumentsWriterFlushQueue.tryPurge(DocumentsWriterFlushQueue.java:148)
        at 
 org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:438)
        at 
 org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:553)
        at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:354)
        at 
 org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:258)
        at 
 org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:243)
        at 
 org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:250)
        at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1091)
        ... 11 moreCaused by: java.lang.OutOfMemoryError: Map failed
        at sun.nio.ch.FileChannelImpl.map0(Native Method)
        at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)*



 And one more issue we came across i.e

 On Sat, Mar 31, 2012 at 3:15 AM, Michael McCandless 
 luc...@mikemccandless.com wrote:

 It's the virtual memory limit that matters; yours says unlimited below
 (good!), but, are you certain that's really the limit your Solr
 process

Re: Virtual Memory very high

2012-04-02 Thread Michael McCandless

Are you seeing a real problem here, besides just being alarmed by the
big numbers from top?

Consumption of virtual memory by itself is basically harmless, as long
as you're not running up against any of the OS limits (and, you're
running a 64 bit JVM).

This is just top telling you that you've mapped large files into the
virtual memory space.

It's not telling you that you don't have any RAM left... virtual
memory is different from RAM.

In my tests, generally MMapDirectory gives faster search performance
than NIOFSDirectory... so unless there's an actual issue, I would
recommend sticking with MMapDirectory.

Mike McCandless

http://blog.mikemccandless.com

On Fri, Dec 9, 2011 at 11:54 PM, Rohit ro...@in-rev.com wrote:
 Hi All,



 Don't know if this question is directly related to this forum, I am running
 Solr in Tomcat on linux server. The moment I start tomcat the virtual memory
 shown using TOP command goes to its max 31.1G and then remains there.



 Is this the right behaviour, why is the virtual memory usage so high. I have
 36GB of ram on the server.



 Tasks: 309 total,   1 running, 308 sleeping,   0 stopped,   0 zombie

 Cpu(s): 19.1%us,  0.2%sy,  0.0%ni, 79.3%id,  1.2%wa,  0.0%hi,  0.2%si,
 0.0%st

 Mem:  49555260k total, 36152224k used, 13403036k free,   121612k buffers

 Swap:   999416k total,        0k used,   999416k free,  5409052k cached



  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

 2741 mysql     20   0 6412m 5.8g 6380 S  182 12.3 108:07.45 mysqld

 2814 root      20   0 31.1g  22g 9716 S  100 46.6 375:51.70 java

 1765 root      20   0 12.2g 285m 9488 S    2  0.6   3:52.59 java

 3591 root      20   0 19352 1576 1068 R    0  0.0   0:00.24 top

    1 root      20   0 23684 1908 1276 S    0  0.0   0:06.21 init



 Regards,

 Rohit

Re: Open deleted index file failing jboss shutdown with Too many open files Error

2012-04-02 Thread Michael McCandless

Hmm, unless the ulimits are low, or the default mergeFactor was
changed, or you have many indexes open in a single JVM, or you keep
too many IndexReaders open, even in an NRT or frequent commit use
case, you should not run out of file descriptors.

Frequent commit/reopen should be perfectly fine, as long as you close
the old readers...

Mike McCandless

http://blog.mikemccandless.com

On Mon, Apr 2, 2012 at 8:35 AM, Erick Erickson erickerick...@gmail.com wrote:
 How often are you committing index updates? This kind of thing
 can happen if you commit too often. Consider setting
 commitWithin to something like, say, 5 minutes. Or doing the
 equivalent with the autoCommit parameters in solrconfig.xml

 If that isn't relevant, you need to provide some more details
 about what you're doing and how you're using Solr

 Best
 Erick

 On Sun, Apr 1, 2012 at 10:47 PM, Gopal Patwa gopalpa...@gmail.com wrote:
 I am using Solr 4.0 nightly build with NRT and I often get this
 error during auto commit Too many open files. I have search this forum
 and what I found it is related to OS ulimit setting, please see below my
 ulimit settings. I am not sure what ulimit setting I should have for open
 file? ulimit -n unlimited?.

 Even if I set to higher number, it will just delay the issue until it reach
 new open file limit. What I have seen that Solr has kept deleted index file
 open by java process, which causing issue for our application server jboss
 to shutdown gracefully due this open files by java process.

 I have seen recently this issue was resolved in lucene, is it TRUE?

 https://issues.apache.org/jira/browse/LUCENE-3855


 I have 3 core with index size : core1 - 70GB, Core2 - 50GB and Core3
 - 15GB, with Single shard

 We update the index every 5 seconds, soft commit every 1 second and hard
 commit every 15 minutes

 Environment: Jboss 4.2, JDK 1.6 64 bit, CentOS , JVM Heap Size = 24GB*


 ulimit:

 core file size          (blocks, -c) 0

 data seg size           (kbytes, -d) unlimited

 scheduling priority             (-e) 0

 file size               (blocks, -f) unlimited

 pending signals                 (-i) 401408

 max locked memory       (kbytes, -l) 1024

 max memory size         (kbytes, -m) unlimited

 open files                      (-n) 4096

 pipe size            (512 bytes, -p) 8

 POSIX message queues     (bytes, -q) 819200

 real-time priority              (-r) 0

 stack size              (kbytes, -s) 10240

 cpu time               (seconds, -t) unlimited

 max user processes              (-u) 401408

 virtual memory          (kbytes, -v) unlimited

 file locks                      (-x) unlimited


 ERROR:*

 *2012-04-01* *20:08:35*,*323* [] *priority=ERROR* *app_name=*
 *thread=pool-10-thread-1* *location=CommitTracker* *line=93* *auto*
 *commit* *error...:org.apache.solr.common.SolrException:* *Error*
 *opening* *new* *searcher*
        *at* 
 *org.apache.solr.core.SolrCore.openNewSearcher*(*SolrCore.java:1138*)
        *at* *org.apache.solr.core.SolrCore.getSearcher*(*SolrCore.java:1251*)
        *at* 
 *org.apache.solr.update.DirectUpdateHandler2.commit*(*DirectUpdateHandler2.java:409*)
        *at* 
 *org.apache.solr.update.CommitTracker.run*(*CommitTracker.java:197*)
        *at* 
 *java.util.concurrent.Executors$RunnableAdapter.call*(*Executors.java:441*)
        *at* 
 *java.util.concurrent.FutureTask$Sync.innerRun*(*FutureTask.java:303*)
        *at* *java.util.concurrent.FutureTask.run*(*FutureTask.java:138*)
        *at* 
 *java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301*(*ScheduledThreadPoolExecutor.java:98*)
        *at* 
 *java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run*(*ScheduledThreadPoolExecutor.java:206*)
        *at* 
 *java.util.concurrent.ThreadPoolExecutor$Worker.runTask*(*ThreadPoolExecutor.java:886*)
        *at* 
 *java.util.concurrent.ThreadPoolExecutor$Worker.run*(*ThreadPoolExecutor.java:908*)
        *at* *java.lang.Thread.run*(*Thread.java:662*)*Caused* *by:*
 *java.io.FileNotFoundException:*
 */opt/mci/data/srwp01mci001/inventory/index/_4q1y_0.tip* (*Too many
 open files*)
        *at* *java.io.RandomAccessFile.open*(*Native* *Method*)
        *at* *java.io.RandomAccessFile.**init*(*RandomAccessFile.java:212*)
        *at* 
 *org.apache.lucene.store.FSDirectory$FSIndexOutput.**init*(*FSDirectory.java:449*)
        *at* 
 *org.apache.lucene.store.FSDirectory.createOutput*(*FSDirectory.java:288*)
        *at* 
 *org.apache.lucene.codecs.BlockTreeTermsWriter.**init*(*BlockTreeTermsWriter.java:161*)
        *at* 
 *org.apache.lucene.codecs.lucene40.Lucene40PostingsFormat.fieldsConsumer*(*Lucene40PostingsFormat.java:66*)
        *at* 
 *org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.addField*(*PerFieldPostingsFormat.java:118*)
        *at* 
 *org.apache.lucene.index.FreqProxTermsWriterPerField.flush*(*FreqProxTermsWriterPerField.java:322*)
        *at*

Re: Large Index and OutOfMemoryError: Map failed

2012-03-31 Thread Michael McCandless

It's the virtual memory limit that matters; yours says unlimited below
(good!), but, are you certain that's really the limit your Solr
process runs with?

On Linux, there is also a per-process map count:

cat /proc/sys/vm/max_map_count

I think it typically defaults to 65,536 but you should check on your
env.  If a process tries to map more than this many regions, you'll
hit that exception.

I think you can:

  cat /proc/pid/maps | wc

to see how many maps your Solr process currently has... if that is
anywhere near the limit then it could be the cause.

Mike McCandless

http://blog.mikemccandless.com

On Sat, Mar 31, 2012 at 1:26 AM, Gopal Patwa gopalpa...@gmail.com wrote:
 *I need help!!*

 *
 *

 *I am using Solr 4.0 nightly build with NRT and I often get this error
 during auto commit **java.lang.OutOfMemoryError:* *Map* *failed. I
 have search this forum and what I found it is related to OS ulimit
 setting, please se below my ulimit settings. I am not sure what ulimit
 setting I should have? and we also get **java.net.SocketException:*
 *Too* *many* *open* *files NOT sure how many open file we need to
 set?*


 I have 3 core with index size : core1 - 70GB, Core2 - 50GB and Core3 -
 15GB, with Single shard

 *
 *

 *We update the index every 5 seconds, soft commit every 1 second and
 hard commit every 15 minutes*

 *
 *

 *Environment: Jboss 4.2, JDK 1.6 , CentOS, JVM Heap Size = 24GB*

 *
 *

 ulimit:

 core file size          (blocks, -c) 0
 data seg size           (kbytes, -d) unlimited
 scheduling priority             (-e) 0
 file size               (blocks, -f) unlimited
 pending signals                 (-i) 401408
 max locked memory       (kbytes, -l) 1024
 max memory size         (kbytes, -m) unlimited
 open files                      (-n) 1024
 pipe size            (512 bytes, -p) 8
 POSIX message queues     (bytes, -q) 819200
 real-time priority              (-r) 0
 stack size              (kbytes, -s) 10240
 cpu time               (seconds, -t) unlimited
 max user processes              (-u) 401408
 virtual memory          (kbytes, -v) unlimited
 file locks                      (-x) unlimited


 *
 *

 *ERROR:*

 *
 *

 *2012-03-29* *15:14:08*,*560* [] *priority=ERROR* *app_name=*
 *thread=pool-3-thread-1* *location=CommitTracker* *line=93* *auto*
 *commit* *error...:java.io.IOException:* *Map* *failed*
        *at* *sun.nio.ch.FileChannelImpl.map*(*FileChannelImpl.java:748*)
        *at* 
 *org.apache.lucene.store.MMapDirectory$MMapIndexInput.**init*(*MMapDirectory.java:293*)
        *at* 
 *org.apache.lucene.store.MMapDirectory.openInput*(*MMapDirectory.java:221*)
        *at* 
 *org.apache.lucene.codecs.lucene40.Lucene40PostingsReader.**init*(*Lucene40PostingsReader.java:58*)
        *at* 
 *org.apache.lucene.codecs.lucene40.Lucene40PostingsFormat.fieldsProducer*(*Lucene40PostingsFormat.java:80*)
        *at* 
 *org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader$1.visitOneFormat*(*PerFieldPostingsFormat.java:189*)
        *at* 
 *org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$VisitPerFieldFile.**init*(*PerFieldPostingsFormat.java:280*)
        *at* 
 *org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader$1.**init*(*PerFieldPostingsFormat.java:186*)
        *at* 
 *org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.**init*(*PerFieldPostingsFormat.java:186*)
        *at* 
 *org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer*(*PerFieldPostingsFormat.java:256*)
        *at* 
 *org.apache.lucene.index.SegmentCoreReaders.**init*(*SegmentCoreReaders.java:108*)
        *at* 
 *org.apache.lucene.index.SegmentReader.**init*(*SegmentReader.java:51*)
        *at* 
 *org.apache.lucene.index.IndexWriter$ReadersAndLiveDocs.getReader*(*IndexWriter.java:494*)
        *at* 
 *org.apache.lucene.index.BufferedDeletesStream.applyDeletes*(*BufferedDeletesStream.java:214*)
        *at* 
 *org.apache.lucene.index.IndexWriter.applyAllDeletes*(*IndexWriter.java:2939*)
        *at* 
 *org.apache.lucene.index.IndexWriter.maybeApplyDeletes*(*IndexWriter.java:2930*)
        *at* 
 *org.apache.lucene.index.IndexWriter.prepareCommit*(*IndexWriter.java:2681*)
        *at* 
 *org.apache.lucene.index.IndexWriter.commitInternal*(*IndexWriter.java:2804*)
        *at* 
 *org.apache.lucene.index.IndexWriter.commit*(*IndexWriter.java:2786*)
        *at* 
 *org.apache.solr.update.DirectUpdateHandler2.commit*(*DirectUpdateHandler2.java:391*)
        *at* 
 *org.apache.solr.update.CommitTracker.run*(*CommitTracker.java:197*)
        *at* 
 *java.util.concurrent.Executors$RunnableAdapter.call*(*Executors.java:441*)
        *at* 
 *java.util.concurrent.FutureTask$Sync.innerRun*(*FutureTask.java:303*)
        *at* *java.util.concurrent.FutureTask.run*(*FutureTask.java:138*)
        *at* 
 *java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301*(*ScheduledThreadPoolExecutor.java:98*)
        *at*

Re: effect of continuous deletes on index's read performance

2012-02-06 Thread Michael McCandless

On Mon, Feb 6, 2012 at 8:20 AM, prasenjit mukherjee
prasen@gmail.com wrote:

 Pardon my ignorance, Why can't the IndexWriter and IndexSearcher share
 the same underlying in-memory datastructure so that IndexSearcher need
 not be reopened with every commit.

Because the semantics of an IndexReader in Lucene guarantee an
unchanging point-in-time view of the index, as of when that
IndexReader was opened.

That said, Lucene has near-real-time readers, which keep point-in-time
semantics but are very fast to open after adding/deleting docs, and do
not require a (costly) commit.  EG see my blog post:


http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html

The tests I ran there indexed at a highish rate (~1000 1KB sized docs
per second, or 1 MB plain text per second, or ~2X Twitter's peak rate,
at least as of last July), and the reopen latency was fast (~ 60
msec).  Admittedly this was a fast machine, and the index was on a
good SSD, and I used NRTCachingDir and MemoryCodec for the id field.

But net/net Lucene's NRT search is very fast.  It should easily handle
your 20 docs/second rate, unless your docs are enormous

Solr trunk has finally cutover to using these APIs, but unfortunately
this has not been backported to Solr 3.x.  You might want to check out
ElasticSearch, an alternative to Solr, which does use Lucene's NRT
APIs

Mike McCandless

http://blog.mikemccandless.com

Re: LUCENE-995 in 3.x

2012-01-05 Thread Michael McCandless

Thank you Ingo!

I think post the 3.x patch directly on the issue?

I'm not sure why this wasn't backported to 3.x the first time around...

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jan 5, 2012 at 8:15 AM, Ingo Renner i...@typo3.org wrote:
 Hi all,

 I've backported LUCENE-995 to 3.x and the unit test for TestQueryParser is 
 green.

 What would be the workflow to actually get it into 3.x now?
 - attach the patch to the original issue or
 - create a new issue attaching the patch there?


 best
 Ingo

 --
 Ingo Renner
 TYPO3 Core Developer, Release Manager TYPO3 4.2, Admin Google Summer of Code

 TYPO3
 Open Source Enterprise Content Management System
 http://typo3.org

Re: LUCENE-995 in 3.x

2012-01-05 Thread Michael McCandless

Awesome, thanks Ingo... I'll have a look!

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jan 5, 2012 at 9:23 AM, Ingo Renner i...@typo3.org wrote:

 Am 05.01.2012 um 15:05 schrieb Michael McCandless:

 Thank you Ingo!

 I think post the 3.x patch directly on the issue?

 thanks for the advice Michael, path is attached: 
 https://issues.apache.org/jira/browse/LUCENE-995


 Ingo

 --
 Ingo Renner
 TYPO3 Core Developer, Release Manager TYPO3 4.2, Admin Google Summer of Code

 TYPO3
 Open Source Enterprise Content Management System
 http://typo3.org

Re: help no segment in my lucene index!!!

2011-11-28 Thread Michael McCandless

Which version of Solr/Lucene were you using when you hit power loss?

There was a known bug that could allow power loss to cause corruption,
but this was fixed in Lucene 3.4.0.

Unfortunately, there is no easy way to recreate the segments_N file...
in principle it should be possible and maybe not too much work but
nobody has created such a tool yet, that I know of.

Mike McCandless

http://blog.mikemccandless.com

On Mon, Nov 28, 2011 at 5:54 AM, Roberto Iannone
roberto.iann...@gmail.com wrote:
 Hi all,

 after a power supply inperruption my lucene index (about 28 GB) looks like
 this:

 18/11/2011  20:29     2.016.961.997 _3d.fdt
 18/11/2011  20:29         1.816.004 _3d.fdx
 18/11/2011  20:29                89 _3d.fnm
 18/11/2011  20:30       197.323.436 _3d.frq
 18/11/2011  20:30         1.816.004 _3d.nrm
 18/11/2011  20:30       358.016.461 _3d.prx
 18/11/2011  20:30           637.604 _3d.tii
 18/11/2011  20:30        48.565.519 _3d.tis
 18/11/2011  20:31           454.004 _3d.tvd
 18/11/2011  20:31     1.695.380.935 _3d.tvf
 18/11/2011  20:31         3.632.004 _3d.tvx
 18/11/2011  23:33     2.048.500.822 _6g.fdt
 18/11/2011  23:33         3.032.004 _6g.fdx
 18/11/2011  23:33                89 _6g.fnm
 18/11/2011  23:34       221.593.644 _6g.frq
 18/11/2011  23:34         3.032.004 _6g.nrm
 18/11/2011  23:34       350.136.996 _6g.prx
 18/11/2011  23:34           683.668 _6g.tii
 18/11/2011  23:34        52.224.328 _6g.tis
 18/11/2011  23:36           758.004 _6g.tvd
 18/11/2011  23:36     1.758.786.158 _6g.tvf
 18/11/2011  23:36         6.064.004 _6g.tvx
 19/11/2011  03:29     1.966.167.843 _9j.fdt
 19/11/2011  03:29         3.832.004 _9j.fdx
 19/11/2011  03:28                89 _9j.fnm
 19/11/2011  03:30       222.733.606 _9j.frq
 19/11/2011  03:30         3.832.004 _9j.nrm
 19/11/2011  03:30       324.722.843 _9j.prx
 19/11/2011  03:30           715.441 _9j.tii
 19/11/2011  03:30        54.488.546 _9j.tis
 

 without any segment files!
 I tried to fix with CheckIndex utility in lucene, but I got the following
 message:

 ERROR: could not read any segments file in directory
 org.apache.lucene.index.IndexNotFoundException: no segments* file found in
 org.a
 pache.lucene.store.MMapDirectory@E:\recover_me
 lockFactory=org.apache.lucene.sto
 re.NativeFSLockFactory@5d36d1d7: files: [_3d.fdt, _3d.fdx, _3d.fnm,
 _3d.frq, _3d
 .nrm, _3d.prx, _3d.tii, _3d.tis, _3d.tvd, _3d.tvf, _3d.tvx, _6g.fdt,
 _6g.fdx, _6
 g.fnm, _6g.frq, _6g.nrm, _6g.prx, _6g.tii, _6g.tis, _6g.tvd, _6g.tvf,
 _6g.tvx, _
 9j.fdt, _9j.fdx, _9j.fnm, _9j.frq, _9j.nrm, _9j.prx, _9j.tii, _9j.tis,
 _9j.tvd,
 _9j.tvf, _9j.tvx, _cf.cfs, _cm.fdt, _cm.fdx, _cm.fnm, _cm.frq, _cm.nrm,
 _cm.prx,
  _cm.tii, _cm.tis, _cm.tvd, _cm.tvf, _cm.tvx, _ff.fdt, _ff.fdx, _ff.fnm,
 _ff.frq
 , _ff.nrm, _ff.prx, _ff.tii, _ff.tis, _ff.tvd, _ff.tvf, _ff.tvx, _ii.fdt,
 _ii.fd
 x, _ii.fnm, _ii.frq, _ii.nrm, _ii.prx, _ii.tii, _ii.tis, _ii.tvd, _ii.tvf,
 _ii.t
 vx, _lc.cfs, _ll.fdt, _ll.fdx, _ll.fnm, _ll.frq, _ll.nrm, _ll.prx, _ll.tii,
 _ll.
 tis, _ll.tvd, _ll.tvf, _ll.tvx, _lo.cfs, _lp.cfs, _lq.cfs, _lr.cfs,
 _ls.cfs, _lt
 .cfs, _lu.cfs, _lv.cfs, _lw.fdt, _lw.fdx, _lw.tvd, _lw.tvf, _lw.tvx,
 _m.fdt, _m.
 fdx, _m.fnm, _m.frq, _m.nrm, _m.prx, _m.tii, _m.tis, _m.tvd, _m.tvf, _m.tvx]
        at
 org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfo
 s.java:712)
        at
 org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfo
 s.java:593)
        at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:359)
        at
 org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:327)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:995)

 There's a way to recover this index ?

 Cheers

 Rob

Re: help no segment in my lucene index!!!

2011-11-28 Thread Michael McCandless

On Mon, Nov 28, 2011 at 10:49 AM, Roberto Iannone
iann...@crmpa.unisa.it wrote:
 Hi Michael,

 thx for your help :)

You're welcome!

 2011/11/28 Michael McCandless luc...@mikemccandless.com

 Which version of Solr/Lucene were you using when you hit power loss?

 I'm using Lucene 3.4.

Hmm, which OS/filesystem?  Unexpected power loss (nor OS crash, JVM
crash) in 3.4.0 should not cause corrumption, as long as the IO system
properly implements fsync.

 There was a known bug that could allow power loss to cause corruption,
 but this was fixed in Lucene 3.4.0.

 Unfortunately, there is no easy way to recreate the segments_N file...
 in principle it should be possible and maybe not too much work but
 nobody has created such a tool yet, that I know of.

 some hints about how could I write this code by myself ?

Well, you'd need to take a listing of all files, aggregate those into
unique segment names, open a SegmentReader on each segment name, and
from that SegmentReader reconstruct what you can (numDocs, delCount,
isCompoundFile, etc.) about each SegmentInfo.  Add all the resulting
SegmentInfo instances into a new SegmentInfos and write it to the
directory.

Was the index newly created in 3.4.x?  If not (if you inherited
segments from earlier Lucene versions) you might also have to
reconstruct shared doc stores (stored fields, term vectors) files,
which will be trickier...

Mike

Re: Parent-child options

2011-11-08 Thread Michael McCandless

Lucene itself has BlockJoinQuery/Collector (in contrib/join), which is
what ElasticSearch is using under the hood for its nested documents (I
think?).

But I don't think this has been exposed in Solr yet patches welcome!

Mike McCandless

http://blog.mikemccandless.com

On Tue, Nov 8, 2011 at 12:59 PM, Jean Maynier jmayn...@eco2market.com wrote:
 Hello,

 Did someone find a way to solve the parent-child problem? The Join option
 is too complex because you have to create multiple document type and do the
 join in the query.

 ElasticSearch did a better job at solving this problem:
 http://www.elasticsearch.org/guide/reference/mapping/nested-type.html
 http://www.elasticsearch.org/guide/reference/query-dsl/nested-query.html

 Is Solr has a similar feature (at least in the roadmap) ? I don't want to
 change for ES (too much changed) but it seems better for the moment for
 structured content.

 --
 Jean Maynier

Re: large scale indexing issues / single threaded bottleneck

2011-10-29 Thread Michael McCandless

On Fri, Oct 28, 2011 at 3:27 PM, Simon Willnauer
simon.willna...@googlemail.com wrote:

 one more thing, after somebody (thanks robert) pointed me at the
 stacktrace it seems kind of obvious what the root cause of your
 problem is. Its solr :) Solr closes the IndexWriter on commit which is
 very wasteful since you basically wait until all merges are done. Solr
 trunk has solved this problem.

That is very wasteful but I don't think it's actually the cause of the
slowdown here?

The cause looks like it's in applying deletes, which even once Solr
stops closing the IW will still occur (ie, IW.commit must also resolve
all deletes).

When IW resolves deletes it 1) opens a SegmentReader for each segment
in the index, and 2) looks up each deleted term and mark its
document(s) as deleted.

I saw a mention somewhere that you can tell Solr not to use
IW.addDocument (not IW.updateDocument) when you add a document if you
are certain it's not replacing a previous document with the same ID --
I don't know how to do that but if that's true, and you are truly only
adding documents, that could be the easiest fix here.

Failing that... you could try increasing
IndexWriterConfig.setReaderTermsIndexDivisor (not sure if/how this is
exposed in Solr's config)... this will make init time and RAM usage
for each SegmentReader faster, but lookup time slower; whether this
helps depends on if your slowness is in opening the SegmentReader (how
long does it take to IR.open on your index?) or on resolving the
deletes once SR is open.

Do you have a great many terms in your index?  Can you run CheckIndex
and post the output?  (If so this might mean you have an analysis
problem, ie, putting too many terms in the index).

 We should maybe try to fix this in 3.x too?

+1; having to wait for running merges to complete when the app calls
commit is crazy (Lucene long ago removed that limitation).

Mike McCandless

http://blog.mikemccandless.com

Re: How to make UnInvertedField faster?

2011-10-22 Thread Michael McCandless

On Sat, Oct 22, 2011 at 4:10 AM, Simon Willnauer
simon.willna...@googlemail.com wrote:
 On Fri, Oct 21, 2011 at 4:37 PM, Michael McCandless
 luc...@mikemccandless.com wrote:
 Well... the limitation of DocValues is that it cannot handle more than
 one value per document (which UnInvertedField can).

 you can pack this into one byte[] or use more than one field? I don't
 see a real limitation here.

Well... not very easily?

UnInvertedField (DocTermOrds in Lucene) is the same as DocValues'
BYTES_VAR_SORTED.

So for an app to do this on top it'd have to handle the term - ord
resolving itself, save that somewhere, then encode the multiple ords
into a byte[].

I agree for other simple types (no deref/sorting involved) an app
could pack them into its own byte[] that's otherwise opaque to Lucene.

Mike McCandless

http://blog.mikemccandless.com

Re: How to make UnInvertedField faster?

2011-10-21 Thread Michael McCandless

Well... the limitation of DocValues is that it cannot handle more than
one value per document (which UnInvertedField can).

Hopefully we can fix that at some point :)

Mike McCandless

http://blog.mikemccandless.com

On Fri, Oct 21, 2011 at 7:50 AM, Simon Willnauer
simon.willna...@googlemail.com wrote:
 In trunk we have a feature called IndexDocValues which basically
 creates the uninverted structure at index time. You can then simply
 suck that into memory or even access it on disk directly
 (RandomAccess). Even if I can't help you right now this is certainly
 going to help you here. There is no need to uninvert at all anymore in
 lucene 4.0

 simon

 On Wed, Oct 19, 2011 at 8:05 PM, Michael Ryan mr...@moreover.com wrote:
 I was wondering if anyone has any ideas for making UnInvertedField.uninvert()
 faster, or other alternatives for generating facets quickly.

 The vast majority of the CPU time for our Solr instances is spent generating
 UnInvertedFields after each commit. Here's an example of one of our slower 
 fields:

 [2011-10-19 17:46:01,055] INFO125974[pool-1-thread-1] - (SolrCore:440) -
 UnInverted multi-valued field 
 {field=authorCS,memSize=38063628,tindexSize=422652,
 time=15610,phase1=15584,nTerms=1558514,bigTerms=0,termInstances=4510674,uses=0}

 That is from an index with approximately 8 million documents. After each 
 commit,
 it takes on average about 90 seconds to uninvert all the fields that we 
 facet on.

 Any ideas at all would be greatly appreciated.

 -Michael

Re: Indexing PDF

2011-10-05 Thread Michael McCandless

Can you attach this PDF to an email  send to the list?  Or is it too
large for that?

Or, you can try running Tika directly on the PDF to see if it's able
to extract the text.

Mike McCandless

http://blog.mikemccandless.com

2011/10/5 Héctor Trujillo hecto...@gmail.com:
 Sorry you have the reason, this file was indexed with a .Net web service
 client, that calls a Java application(a web service) that calls Solr using
 SolrJ.

 I will try to index this in a different way, may be this resolve the
 problem.

 Thanks

 Best regards



 El 5 de octubre de 2011 08:42, Héctor Trujillo hecto...@gmail.comescribió:

   It seems unreasonable that if I want to index a local file, I have to
 references this local file by an URL.

 This isn't a estrange file, this is a file downloaded from lucid web portal
 called: Starting a Search Application.pdf

 This problem may be a codification problem, or char set problem. I open
 this file with a PDF Reader and I have no problems, and I don’t Know why
 referencing this file with and URL will fix this problem, can you help me?

 I'm working with SolrJ, from Java, does some have the same problem with
 SolrJ?



 Thanks to Paul Libbrecht, for your option.



 Best regards






 2011/10/4 Paul Libbrecht p...@hoplahup.net

 full of boxes for me.
 Héctor, you need another way to reference these!
 (e.g. a URL)

 paul


 Le 4 oct. 2011 à 16:49, Héctor Trujillo a écrit :

  Hi all, I'm indexing pdf's files with SolrJ, and most of them work. But
 with
  some files I’ve got problems because they stored estrange characters. I
 got
  stored this content:
  +++
 
  Starting a Search Application
 
 
  Abstract
 
 Starting
  a Search Application A Lucid Imagination White Paper ¥ April 2009 Page
 i
 
 
  Starting a Search Application A Lucid Imagination White Paper ¥ April
 2009
  Page ii Do You Need Full-text Search?
 
 ∞
 
 ∞
  ∞

Re: Indexing PDF

2011-10-05 Thread Michael McCandless

Hmm, no attachment; maybe it's too large?

Can you send it directly to me?

Mike McCandless

http://blog.mikemccandless.com

2011/10/5 Héctor Trujillo hecto...@gmail.com:
 This is the file that give me errors.

 2011/10/5 Michael McCandless luc...@mikemccandless.com

 Can you attach this PDF to an email  send to the list?  Or is it too
 large for that?

 Or, you can try running Tika directly on the PDF to see if it's able
 to extract the text.

 Mike McCandless

 http://blog.mikemccandless.com

 2011/10/5 Héctor Trujillo hecto...@gmail.com:
  Sorry you have the reason, this file was indexed with a .Net web service
  client, that calls a Java application(a web service) that calls Solr
  using
  SolrJ.
 
  I will try to index this in a different way, may be this resolve the
  problem.
 
  Thanks
 
  Best regards
 
 
 
  El 5 de octubre de 2011 08:42, Héctor Trujillo
  hecto...@gmail.comescribió:
 
    It seems unreasonable that if I want to index a local file, I have to
  references this local file by an URL.
 
  This isn't a estrange file, this is a file downloaded from lucid web
  portal
  called: Starting a Search Application.pdf
 
  This problem may be a codification problem, or char set problem. I open
  this file with a PDF Reader and I have no problems, and I don’t Know
  why
  referencing this file with and URL will fix this problem, can you help
  me?
 
  I'm working with SolrJ, from Java, does some have the same problem with
  SolrJ?
 
 
 
  Thanks to Paul Libbrecht, for your option.
 
 
 
  Best regards
 
 
 
 
 
 
  2011/10/4 Paul Libbrecht p...@hoplahup.net
 
  full of boxes for me.
  Héctor, you need another way to reference these!
  (e.g. a URL)
 
  paul
 
 
  Le 4 oct. 2011 à 16:49, Héctor Trujillo a écrit :
 
   Hi all, I'm indexing pdf's files with SolrJ, and most of them work.
   But
  with
   some files I’ve got problems because they stored estrange
   characters. I
  got
   stored this content:
   +++
  
   Starting a Search Application
  
 
  
   Abstract
  
 
  Starting
   a Search Application A Lucid Imagination White Paper ¥ April 2009
   Page
  i
  
 
  
   Starting a Search Application A Lucid Imagination White Paper ¥
   April
  2009
   Page ii Do You Need Full-text Search

Re: Query failing because of omitTermFreqAndPositions

2011-10-04 Thread Michael McCandless

This is because, within one segment only 1 value (omitP or not) is
possible, for all the docs in that segment.

This then means, on merging segments with different values for omitP,
Lucene must reconcile the different values, and that reconciliation
will favor omitting positions (if it went the other way, Lucene would
have to make up fake positions, which seems very dangerous).

Even if you delete all documents containing that field, and optimize
down to one segment, this omitPositions bit will still stick,
because of how Lucene stores the metadata per field.

omitNorms also behaves this way: once omitted, always omitted.

Mike McCandless

http://blog.mikemccandless.com



On Tue, Oct 4, 2011 at 1:41 AM, Isan Fulia isan.fu...@germinait.com wrote:
 Hi Mike,

 Thanks for the information.But why is it that once omiited positions in the
 past , it will always omit positions
 even if omitPositions is made false.

 Thanks,
 Isan Fulia.

 On 29 September 2011 17:49, Michael McCandless 
 luc...@mikemccandless.comwrote:

 Once a given field has omitted positions in the past, even for just
 one document, it sticks and that field will forever omit positions.

 Try creating a new index, never omitting positions from that field?

 Mike McCandless

 http://blog.mikemccandless.com

 On Thu, Sep 29, 2011 at 1:14 AM, Isan Fulia isan.fu...@germinait.com
 wrote:
  Hi All,
 
  My schema consisted of field textForQuery which was defined as
  field name=textForQuery type=text indexed=true stored=false
  multiValued=true/
 
  After indexing 10 lakhs  of  documents  I changed the field to
  field name=textForQuery type=text indexed=true stored=false
  multiValued=true *omitTermFreqAndPositions=true*/
 
  So documents that were indexed after that omiited the position
 information
  of the terms.
  As a result I was not able to search the text which rely on position
  information for eg. coke studio at mtv even though its present in some
  documents.
 
  So I again changed the field textForQuery to
  field name=textForQuery type=text indexed=true stored=false
  multiValued=true/
 
  But now even for new documents added  the query requiring positon
  information is still failing.
  For example i reindexed certain documents that consisted of coke studio
 at
  mtv but still the query is not returning any documents when searched for
  *textForQuery:coke studio at mtv*
 
  Can anyone please help me out why this is happening
 
 
  --
  Thanks  Regards,
  Isan Fulia.
 




 --
 Thanks  Regards,
 Isan Fulia.

Re: Query failing because of omitTermFreqAndPositions

2011-09-29 Thread Michael McCandless

Once a given field has omitted positions in the past, even for just
one document, it sticks and that field will forever omit positions.

Try creating a new index, never omitting positions from that field?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Sep 29, 2011 at 1:14 AM, Isan Fulia isan.fu...@germinait.com wrote:
 Hi All,

 My schema consisted of field textForQuery which was defined as
 field name=textForQuery type=text indexed=true stored=false
 multiValued=true/

 After indexing 10 lakhs  of  documents  I changed the field to
 field name=textForQuery type=text indexed=true stored=false
 multiValued=true *omitTermFreqAndPositions=true*/

 So documents that were indexed after that omiited the position information
 of the terms.
 As a result I was not able to search the text which rely on position
 information for eg. coke studio at mtv even though its present in some
 documents.

 So I again changed the field textForQuery to
 field name=textForQuery type=text indexed=true stored=false
 multiValued=true/

 But now even for new documents added  the query requiring positon
 information is still failing.
 For example i reindexed certain documents that consisted of coke studio at
 mtv but still the query is not returning any documents when searched for
 *textForQuery:coke studio at mtv*

 Can anyone please help me out why this is happening


 --
 Thanks  Regards,
 Isan Fulia.

Re: Example setting TieredMergePolicy for Solr 3.3 or 3.4?

2011-09-22 Thread Michael McCandless

On Wed, Sep 21, 2011 at 10:10 PM, Michael Sokolov soko...@ifactory.com wrote:

 I wonder if config-file validation would be helpful here :) I posted a patch
 in SOLR-1758 once.

Big +1.

We should aim for as stringent config file checking as possible.

Mike McCandless

http://blog.mikemccandless.com

Re: Optimize fails with OutOfMemory Exception - sun.nio.ch.FileChannelImpl.map involved

2011-09-22 Thread Michael McCandless

Are you sure you are using a 64 bit JVM?

Are you sure you really changed your vmem limit to unlimited?  That
should have resolved the OOME from mmap.

Or: can you run cat /proc/sys/vm/max_map_count?  This is a limit on
the total number of maps in a single process, that Linux imposes.  But
the default limit is usually high (64K), so it'd be surprising if you
are hitting that unless it's lower in your env.

The amount of [free] RAM on the machine should have no bearing on
whether mmap succeeds or fails; it's the available address space (32
bit is tiny; 64 bit is immense) and then any OS limits imposed.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Sep 22, 2011 at 5:27 AM, Ralf Matulat ralf.matu...@bundestag.de wrote:
 Good morning!
 Recently we slipped into an OOME by optimizing our index. It looks like it's
 regarding to the nio class and the memory-handling.
 I'll try to describe the environment, the error and what we did to solve the
 problem. Nevertheless, none of our approaches was successful.

 The environment:

 - Tested with both, SOLR 3.3  3.4
 - SuSE SLES 11 (X64)virtual machine with 16GB RAM
 - ulimi: virtual memory 14834560 (14GB)
 - Java: java-1_6_0-ibm-1.6.0-124.5
 - Apache Tomcat/6.0.29

 - Index Size (on filesystem): ~5GB, 1.1 million text documents.

 The error:
 First, building the index from scratch with a mysql DIH, with an empty
 index-Dir works fine.
 Building an index with command=full-import, when the old segment files
 still in place, fails with an OutOfMemoryException. Same as optimizing the
 index.
 Doing an optimize fails after some time with:

 SEVERE: java.io.IOException: background merge hit exception:
 _6p(3.4):Cv1150724 _70(3.4):Cv667 _73(3.4):Cv7 _72(3.4):Cv4 _71(3.4):Cv1
 into _74 [optimize]
        at
 org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2552)
        at
 org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2472)
        at
 org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:410)
        at
 org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:85)
        at
 org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:154)
        at
 org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:107)
        at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:61)
        at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
        at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
        at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
        at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
        at
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
        at
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
        at
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
        at java.lang.Thread.run(Thread.java:735)
 Caused by: java.io.IOException: Map failed
        at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:765)
        at
 org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(MMapDirectory.java:264)
        at
 org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:216)
        at
 org.apache.lucene.index.SegmentCoreReaders.init(SegmentCoreReaders.java:89)
        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:115)
        at
 org.apache.lucene.index.IndexWriter$ReaderPool.get(IndexWriter.java:710)
        at
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4378)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3917)
        at
 org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
        at
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)
 Caused by: java.lang.OutOfMemoryError: Map failed
        at sun.nio.ch.FileChannelImpl.map0(Native Method)
        at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:762)

Re: Optimize fails with OutOfMemory Exception - sun.nio.ch.FileChannelImpl.map involved

2011-09-22 Thread Michael McCandless

OK, excellent.  Thanks for bringing closure,

Mike McCandless

http://blog.mikemccandless.com

On Thu, Sep 22, 2011 at 9:00 AM, Ralf Matulat ralf.matu...@bundestag.de wrote:
 Dear Mike,
 thanks for your your reply.
 Just a couple of minutes we found a solution or - to be honest - where we
 went wrong.
 Our failure was the use of ulimit. We missed, that ulimit sets the vmem for
 each shell seperatly. So we set 'ulimit -v unlimited' on a shell, thinking
 that we've done the job correctly.
 As we recognized our mistake, we added 'ulimit -v unlimited' to our
  init-Skript of the tomcat-instance and now it looks like everything works
 as aspected.
 Need some further testing with the java versions, but I'm quite optimistic.
 Best regards
 Ralf

 Am 22.09.2011 14:46, schrieb Michael McCandless:

 Are you sure you are using a 64 bit JVM?

 Are you sure you really changed your vmem limit to unlimited?  That
 should have resolved the OOME from mmap.

 Or: can you run cat /proc/sys/vm/max_map_count?  This is a limit on
 the total number of maps in a single process, that Linux imposes.  But
 the default limit is usually high (64K), so it'd be surprising if you
 are hitting that unless it's lower in your env.

 The amount of [free] RAM on the machine should have no bearing on
 whether mmap succeeds or fails; it's the available address space (32
 bit is tiny; 64 bit is immense) and then any OS limits imposed.

 Mike McCandless

 http://blog.mikemccandless.com

 On Thu, Sep 22, 2011 at 5:27 AM, Ralf Matulatralf.matu...@bundestag.de
  wrote:

 Good morning!
 Recently we slipped into an OOME by optimizing our index. It looks like
 it's
 regarding to the nio class and the memory-handling.
 I'll try to describe the environment, the error and what we did to solve
 the
 problem. Nevertheless, none of our approaches was successful.

 The environment:

 - Tested with both, SOLR 3.3  3.4
 - SuSE SLES 11 (X64)virtual machine with 16GB RAM
 - ulimi: virtual memory 14834560 (14GB)
 - Java: java-1_6_0-ibm-1.6.0-124.5
 - Apache Tomcat/6.0.29

 - Index Size (on filesystem): ~5GB, 1.1 million text documents.

 The error:
 First, building the index from scratch with a mysql DIH, with an empty
 index-Dir works fine.
 Building an index withcommand=full-import, when the old segment files
 still in place, fails with an OutOfMemoryException. Same as optimizing
 the
 index.
 Doing an optimize fails after some time with:

 SEVERE: java.io.IOException: background merge hit exception:
 _6p(3.4):Cv1150724 _70(3.4):Cv667 _73(3.4):Cv7 _72(3.4):Cv4 _71(3.4):Cv1
 into _74 [optimize]
        at
 org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2552)
        at
 org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2472)
        at

 org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:410)
        at

 org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:85)
        at

 org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:154)
        at

 org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:107)
        at

 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:61)
        at

 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
        at

 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
        at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
        at

 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at

 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at

 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at

 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at

 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at

 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at

 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at

 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
        at

 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
        at

 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
        at
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
        at java.lang.Thread.run(Thread.java:735)
 Caused by: java.io.IOException: Map failed
        at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:765)
        at

 org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(MMapDirectory.java:264

Re: Optimize fails with OutOfMemory Exception - sun.nio.ch.FileChannelImpl.map involved

2011-09-22 Thread Michael McCandless

Unfortunately I really don't know ;)  Every time I set forth to figure
things like this out I seem to learn some new way...

Maybe someone else knows?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Sep 22, 2011 at 2:15 PM, Shawn Heisey s...@elyograg.org wrote:
 Michael,

 What is the best central place on an rpm-based distro (CentOS 6 in my case)
 to raise the vmem limit for specific user(s), assuming it's not already
 correct?  I'm using /etc/security/limits.conf to raise the open file limit
 for the user that runs Solr:

 ncindex         hard    nofile  65535
 ncindex         soft    nofile  49151

 Thanks,
 Shawn


 On 9/22/2011 9:56 AM, Michael McCandless wrote:

 OK, excellent.  Thanks for bringing closure,

 Mike McCandless

 http://blog.mikemccandless.com

 On Thu, Sep 22, 2011 at 9:00 AM, Ralf Matulatralf.matu...@bundestag.de
  wrote:

 Dear Mike,
 thanks for your your reply.
 Just a couple of minutes we found a solution or - to be honest - where we
 went wrong.
 Our failure was the use of ulimit. We missed, that ulimit sets the vmem
 for
 each shell seperatly. So we set 'ulimit -v unlimited' on a shell,
 thinking
 that we've done the job correctly.
 As we recognized our mistake, we added 'ulimit -v unlimited' to our
  init-Skript of the tomcat-instance and now it looks like everything
 works
 as aspected.

Re: MMapDirectory failed to map a 23G compound index segment

2011-09-20 Thread Michael McCandless

Since you hit OOME during mmap, I think this is an OS issue not a JVM
issue.  Ie, the JVM isn't running out of memory.

How many segments were in the unoptimized index?  It's possible the OS
rejected the mmap because of process limits.  Run cat
/proc/sys/vm/max_map_count to see how many mmaps are allowed.

Or: is it possible you reopened the reader several times against the
index (ie, after committing from Solr)?  If so, I think 2.9.x never
unmaps the mapped areas, and so this would accumulate against the
system limit.

 My memory of this is a little rusty but isn't mmap also limited by mem + swap 
 on the box? What does 'free -g' report?

I don't think this should be the case; you are using a 64 bit OS/JVM
so in theory (except for OS system wide / per-process limits imposed)
you should be able to mmap up to the full 64 bit address space.

Your virtual memory is unlimited (from ulimit output), so that's good.

Mike McCandless

http://blog.mikemccandless.com

On Wed, Sep 7, 2011 at 12:25 PM, Rich Cariens richcari...@gmail.com wrote:
 Ahoy ahoy!

 I've run into the dreaded OOM error with MMapDirectory on a 23G cfs compound
 index segment file. The stack trace looks pretty much like every other trace
 I've found when searching for OOM  map failed[1]. My configuration
 follows:

 Solr 1.4.1/Lucene 2.9.3 (plus
 SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969
 )
 CentOS 4.9 (Final)
 Linux 2.6.9-100.ELsmp x86_64 yada yada yada
 Java SE (build 1.6.0_21-b06)
 Hotspot 64-bit Server VM (build 17.0-b16, mixed mode)
 ulimits:
    core file size     (blocks, -c)     0
    data seg size    (kbytes, -d)     unlimited
    file size     (blocks, -f)     unlimited
    pending signals    (-i)     1024
    max locked memory     (kbytes, -l)     32
    max memory size     (kbytes, -m)     unlimited
    open files    (-n)     256000
    pipe size     (512 bytes, -p)     8
    POSIX message queues     (bytes, -q)     819200
    stack size    (kbytes, -s)     10240
    cpu time    (seconds, -t)     unlimited
    max user processes     (-u)     1064959
    virtual memory    (kbytes, -v)     unlimited
    file locks    (-x)     unlimited

 Any suggestions?

 Thanks in advance,
 Rich

 [1]
 ...
 java.io.IOException: Map failed
  at sun.nio.ch.FileChannelImpl.map(Unknown Source)
  at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown
 Source)
  at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown
 Source)
  at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source)
  at org.apache.lucene.index.SegmentReader$CoreReaders.init(Unknown Source)

  at org.apache.lucene.index.SegmentReader.get(Unknown Source)
  at org.apache.lucene.index.SegmentReader.get(Unknown Source)
  at org.apache.lucene.index.DirectoryReader.init(Unknown Source)
  at org.apache.lucene.index.ReadOnlyDirectoryReader.init(Unknown Source)
  at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source)
  at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown
 Source)
  at org.apache.lucene.index.DirectoryReader.open(Unknown Source)
  at org.apache.lucene.index.IndexReader.open(Unknown Source)
 ...
 Caused by: java.lang.OutOfMemoryError: Map failed
  at sun.nio.ch.FileChannelImpl.map0(Native Method)
 ...

[ANNOUNCE] Apache Solr 3.4.0 released

2011-09-14 Thread Michael McCandless

September 14 2011, Apache Solr™ 3.4.0 available

The Lucene PMC is pleased to announce the release of Apache Solr 3.4.0.

Apache Solr is the popular, blazing fast open source enterprise search
platform from the Apache Lucene project. Its major features include
powerful full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search.  Solr is highly scalable, providing
distributed search and index replication, and it powers the search and
navigation features of many of the world's largest internet sites.

This release contains numerous bug fixes, optimizations, and
improvements, some of which are highlighted below.  The release
is available for immediate download at:

   http://www.apache.org/dyn/closer.cgi/lucene/solr (see note below).

If you are already using Apache Solr 3.1, 3.2 or 3.3, we strongly
recommend you upgrade to 3.4.0 because of the index corruption bug on OS
or computer crash or power loss (LUCENE-3418), now fixed in 3.4.0.

See the CHANGES.txt file included with the release for a full list of
details.

Solr 3.4.0 Release Highlights:

  * Bug fixes and improvements from Apache Lucene 3.4.0, including a
major bug (LUCENE-3418) whereby a Lucene index could
easily become corrupted if the OS or computer crashed or lost
power.

  * SolrJ client can now parse grouped and range facets results
(SOLR-2523).

  * A new XsltUpdateRequestHandler allows posting XML that's
transformed by a provided XSLT into a valid Solr document
(SOLR-2630).

  * Post-group faceting option (group.truncate) can now compute
facet counts for only the highest ranking documents per-group.
(SOLR-2665).

  * Add commitWithin update request parameter to all update handlers
that were previously missing it.  This tells Solr to commit the
change within the specified amount of time (SOLR-2540).

  * You can now specify NIOFSDirectory (SOLR-2670).

  * New parameter hl.phraseLimit speeds up FastVectorHighlighter
(LUCENE-3234).

  * The query cache and filter cache can now be disabled per request
See http://wiki.apache.org/solr/CommonQueryParameters#Caching_of_filters
(SOLR-2429).

  * Improved memory usage, build time, and performance of
SynonymFilterFactory (LUCENE-3233).

  * Added omitPositions to the schema, so you can omit position
information while still indexing term frequencies (LUCENE-2048).

  * Various fixes for multi-threaded DataImportHandler.

Note: The Apache Software Foundation uses an extensive mirroring network for
distributing releases.  It is possible that the mirror you are using may not
have replicated the release yet.  If that is the case, please try another
mirror.  This also goes for Maven access.

Happy searching,

Apache Lucene/Solr Developers

Re: Nested documents

2011-09-11 Thread Michael McCandless

Even if it applies, this is for Lucene.  I don't think we've added
Solr support for this yet... we should!

Mike McCandless

http://blog.mikemccandless.com

On Sun, Sep 11, 2011 at 12:16 PM, Erick Erickson
erickerick...@gmail.com wrote:
 Does this JIRA apply?

 https://issues.apache.org/jira/browse/LUCENE-3171

 Best
 Erick

 On Sat, Sep 10, 2011 at 8:32 PM, Andy angelf...@yahoo.com wrote:
 Hi,

 Does Solr support nested documents? If not is there any plan to add such a 
 feature?

 Thanks.

Re: What will happen when one thread is closing a searcher while another is searching?

2011-09-06 Thread Michael McCandless

Closing a searcher while thread(s) is/are still using it is definitely
bad, so, this code looks spooky...

But: it possible something higher up (in Solr) is ensuring this code
runs exclusively?  I don't know enough about this part of Solr...

Mike McCandless

http://blog.mikemccandless.com

On Mon, Sep 5, 2011 at 10:43 PM, Li Li fancye...@gmail.com wrote:
 hi all,
     I am using spellcheck in solr 1.4. I found that spell check is not
 implemented as SolrCore. in SolrCore, it uses reference count to track
 current searcher. oldSearcher and newSearcher will both exist if oldSearcher
 is servicing some query. But in FileBasedSpellChecker

  public void build(SolrCore core, SolrIndexSearcher searcher) {
    try {
      loadExternalFileDictionary(core.getSchema(),
 core.getResourceLoader());
      spellChecker.clearIndex();
      spellChecker.indexDictionary(dictionary);
    } catch (IOException e) {
      throw new RuntimeException(e);
    }
  }
  public void clearIndex() throws IOException {
    IndexWriter writer = new IndexWriter(spellIndex, null, true);
    writer.close();

    //close the old searcher
    searcher.close();
    searcher = new IndexSearcher(this.spellIndex);
  }

  it clear old Index and close current searcher. When other thread is doing
 search and searcher.close() is called, will it cause problem? Or
 searcher.close() has finished and new IndexSearch has not yet constructed.
 When other thread try to do search, will it also be problematic?

heads up: re-index trunk Lucene/Solr indices

2011-08-20 Thread Michael McCandless

Hi,

I just committed a new block tree terms dictionary implementation,
which requires fully re-indexing any trunk indices.

See here for details:

https://issues.apache.org/jira/browse/LUCENE-3030

If you are using a released version of Lucene/Solr then you can ignore
this message.

Mike McCandless

http://blog.mikemccandless.com

Re: Solr Join in 3.3.x

2011-08-18 Thread Michael McCandless

Unfortunately Solr's join impl hasn't been backported to 3.x, as far as I know.

You might want to look at ElasticSearch; it has a join implementation
already or use Solr 4.0.

Mike McCandless

http://blog.mikemccandless.com

On Wed, Aug 17, 2011 at 7:40 PM, Cameron Hurst wakemaste...@z33k.com wrote:
 Hello all,

 I was looking into finding a way to do filtering of documents based on
 fields of other documents in the index. In particular I have a document that
 will update very frequently and hundreds that will very rarely change, but
 the rarely changing documents have a field that will change often that is
 denormalized from the frequently changing document. The brute force method I
 have is to reindex all the documents every time that field changes, but this
 at times is a huge load on my server at a critical time that I am trying to
 avoid.

 To avoid this hit I was trying to implement patch SOLR-2272. This opens up a
 join feature to map fields of 1 document onto another (or so my
 understanding is). This would allow me to only update that 1 document and
 have the change applied to all others that rely on it. There is a number of
 spots that this patch fails to apply and I was wondering if anyone has tried
 to use join in 3.3 or any other released version of SOLR or if I the only
 way to do it is use 4.0.

 Also while I found this patch, I am also open to any other ideas that people
 have on how to accomplish what I need, this just seemed like the most direct
 method.

 Thanks for the help,

 Cameron

Re: segment.gen file is not replicated

2011-08-04 Thread Michael McCandless

This file is actually optional; its there for redundancy in case the
filesystem is not reliable when listing a directory.  Ie, normally,
we list the directory to find the latest segments_N file; but if this
is wrong (eg the file system might have stale a cache) then we
fallback to reading the segments.gen file.

For example this is sometimes needed for NFS.

Likely replication is just skipping it?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Aug 4, 2011 at 3:38 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de wrote:

 I have now updated to solr 3.3 but segment.gen is still not replicated.

 Any idea why, is it a bug or a feature?
 Should I write a jira issue for it?

 Regards
 Bernd

 Am 29.07.2011 14:10, schrieb Bernd Fehling:

 Dear list,

 is there a deeper logic behind why the segment.gen file is not
 replicated with solr 3.2?

 Is it obsolete because I have a single segment?

 Regards,
 Bernd

Re: segment.gen file is not replicated

2011-08-04 Thread Michael McCandless

I think we should fix replication to copy it?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Aug 4, 2011 at 8:16 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de wrote:


 Am 04.08.2011 12:52, schrieb Michael McCandless:

 This file is actually optional; its there for redundancy in case the
 filesystem is not reliable when listing a directory.  Ie, normally,
 we list the directory to find the latest segments_N file; but if this
 is wrong (eg the file system might have stale a cache) then we
 fallback to reading the segments.gen file.

 For example this is sometimes needed for NFS.

 Likely replication is just skipping it?

 That was my first idea. If not changed and touched then it will be skipped.

 While being smart I deleted it on slave from index dir and then
 replicated, but segment.gen was not replicated.
 Due to your explanation NFS could not be reliable any more.

 So my idea either a bug or a feature and the experts will know :-)

 Regards
 Bernd


 Mike McCandless

 http://blog.mikemccandless.com

 On Thu, Aug 4, 2011 at 3:38 AM, Bernd Fehling
 bernd.fehl...@uni-bielefeld.de  wrote:

 I have now updated to solr 3.3 but segment.gen is still not replicated.

 Any idea why, is it a bug or a feature?
 Should I write a jira issue for it?

 Regards
 Bernd

 Am 29.07.2011 14:10, schrieb Bernd Fehling:

 Dear list,

 is there a deeper logic behind why the segment.gen file is not
 replicated with solr 3.2?

 Is it obsolete because I have a single segment?

 Regards,
 Bernd

Re: Field collapsing on multiple fields and/or ranges?

2011-07-06 Thread Michael McCandless

I believe the underlying grouping module is now technically able to do
this, because subclasses of the abstract first/second pass grouping
collectors are free to decide what type/value the group key is.

But, we have to fix Solr to allow for compound keys by creating the
necessary concrete subclasses.

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jul 6, 2011 at 6:22 AM, Rih tanrihae...@gmail.com wrote:
 Have the same requirement. What is your workaround for this?


 On Thu, May 12, 2011 at 7:40 AM, arian487 akarb...@tagged.com wrote:

 I'm wondering if there is a way to get the field collapsing to collapse on
 multiple things?  For example, is there a way to get it to collapse on a
 field (lets say 'domain') but ALSO something else (maybe time or
 something)?

 To visualize maybe something like this:

 Group1 has common field 'www.forum1.com' and ALSO the posts are all from
 may
 11
 Group2 has common field 'www.forum2.com' and ALSO the posts are all from
 may
 11
 .
 .
 .
 GroupX has common field 'www.forum1.com' and ALSO the posts from may 12

 So obviously it's still sorted by date but it won't group the
 'www.forum1.com' things together if the document is from a different date,
 it'll group common date AND common domain field.

 Thanks!

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Field-collapsing-on-multiple-fields-and-or-ranges-tp2929793p2929793.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Cannot I search documents added by IndexWriter after commit?

2011-07-05 Thread Michael McCandless

After your writer.commit you need to reopen your searcher to see the changes.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jul 5, 2011 at 1:48 PM, Gabriele Kahlout
gabri...@mysimpatico.com wrote:
    @Test
    public void testUpdate() throws IOException,
 ParserConfigurationException, SAXException, ParseException {
        Analyzer analyzer = getAnalyzer();
        QueryParser parser = new QueryParser(Version.LUCENE_32, content,
 analyzer);
        Query allQ = parser.parse(*:*);

        IndexWriter writer = getWriter();
        IndexSearcher searcher = new IndexSearcher(IndexReader.open(writer,
 true));
        TopDocs docs = searcher.search(allQ, 10);
 *        assertEquals(0, docs.totalHits); // empty/no index*

        Document doc = getDoc();
        writer.addDocument(doc);
        writer.commit();

        docs = searcher.search(allQ, 10);
 *        assertEquals(1,docs.totalHits); //it fails here. docs.totalHits
 equals 0*
    }
 What am I doing wrong here?

 If I initialize searcher with new IndexSearcher(directory) I'm told:
 org.apache.lucene.index.IndexNotFoundException: no segments* file found in
 org.apache.lucene.store.RAMDirectory@3caa4blockFactory=org.apache.lucene.store.SingleInstanceLockFactory@ed0220c:
 files: []

 --
 Regards,
 K. Gabriele

 --- unchanged since 20/9/10 ---
 P.S. If the subject contains [LON] or the addressee acknowledges the
 receipt within 48 hours then I don't resend the email.
 subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
  Now + 48h) ⇒ ¬resend(I, this).

 If an email is sent by a sender that is not a trusted contact or the email
 does not contain a valid code then the email is not received. A valid code
 starts with a hyphen and ends with X.
 ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
 L(-[a-z]+[0-9]X)).

Re: Cannot I search documents added by IndexWriter after commit?

2011-07-05 Thread Michael McCandless

Sorry, you must reopen the underlying IndexReader, and then make a new
IndexSearcher from the reopened reader.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jul 5, 2011 at 2:12 PM, Gabriele Kahlout
gabri...@mysimpatico.com wrote:
 and how do you do that? There is no reopen method

 On Tue, Jul 5, 2011 at 8:09 PM, Michael McCandless 
 luc...@mikemccandless.com wrote:

 After your writer.commit you need to reopen your searcher to see the
 changes.

 Mike McCandless

 http://blog.mikemccandless.com

 On Tue, Jul 5, 2011 at 1:48 PM, Gabriele Kahlout
 gabri...@mysimpatico.com wrote:
     @Test
     public void testUpdate() throws IOException,
  ParserConfigurationException, SAXException, ParseException {
         Analyzer analyzer = getAnalyzer();
         QueryParser parser = new QueryParser(Version.LUCENE_32, content,
  analyzer);
         Query allQ = parser.parse(*:*);
 
         IndexWriter writer = getWriter();
         IndexSearcher searcher = new
 IndexSearcher(IndexReader.open(writer,
  true));
         TopDocs docs = searcher.search(allQ, 10);
  *        assertEquals(0, docs.totalHits); // empty/no index*
 
         Document doc = getDoc();
         writer.addDocument(doc);
         writer.commit();
 
         docs = searcher.search(allQ, 10);
  *        assertEquals(1,docs.totalHits); //it fails here. docs.totalHits
  equals 0*
     }
  What am I doing wrong here?
 
  If I initialize searcher with new IndexSearcher(directory) I'm told:
  org.apache.lucene.index.IndexNotFoundException: no segments* file found
 in
  org.apache.lucene.store.RAMDirectory@3caa4blockFactory
 =org.apache.lucene.store.SingleInstanceLockFactory@ed0220c:
  files: []
 
  --
  Regards,
  K. Gabriele
 
  --- unchanged since 20/9/10 ---
  P.S. If the subject contains [LON] or the addressee acknowledges the
  receipt within 48 hours then I don't resend the email.
  subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
 time(x)
   Now + 48h) ⇒ ¬resend(I, this).
 
  If an email is sent by a sender that is not a trusted contact or the
 email
  does not contain a valid code then the email is not received. A valid
 code
  starts with a hyphen and ends with X.
  ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
  L(-[a-z]+[0-9]X)).
 




 --
 Regards,
 K. Gabriele

 --- unchanged since 20/9/10 ---
 P.S. If the subject contains [LON] or the addressee acknowledges the
 receipt within 48 hours then I don't resend the email.
 subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
  Now + 48h) ⇒ ¬resend(I, this).

 If an email is sent by a sender that is not a trusted contact or the email
 does not contain a valid code then the email is not received. A valid code
 starts with a hyphen and ends with X.
 ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
 L(-[a-z]+[0-9]X)).

Re: Fuzzy Query Param

2011-06-30 Thread Michael McCandless

Good question... I think in Lucene 4.0, the edit distance is (will be)
in Unicode code points, but in past releases, it's UTF16 code units.

Mike McCandless

http://blog.mikemccandless.com

2011/6/30 Floyd Wu floyd...@gmail.com:
 if this is edit distance implementation, what is the result apply to CJK
 query? For example, 您好~3

 Floyd


 2011/6/30 entdeveloper cameron.develo...@gmail.com

 I'm using Solr trunk.

 If it's levenstein/edit distance, that's great, that's what I want. It just
 didn't seem to be officially documented anywhere so I wanted to find out
 for
 sure. Thanks for confirming.

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Fuzzy-Query-Param-tp3120235p3122418.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Fuzzy Query Param

2011-06-29 Thread Michael McCandless

Which version of Solr (Lucene) are you using?

Recent versions of Lucene now accept ~N  1 to be edit distance.  Ie
foobar~2 matches any term that's = 2 edit distance away from foobar.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jun 28, 2011 at 11:00 PM, entdeveloper
cameron.develo...@gmail.com wrote:
 According to the docs on lucene query syntax:

 Starting with Lucene 1.9 an additional (optional) parameter can specify the
 required similarity. The value is between 0 and 1, with a value closer to 1
 only terms with a higher similarity will be matched.

 I was messing around with this and started doing queries with values greater
 than 1 and it seemed to be doing something. However I haven't been able to
 find any documentation on this.

 What happens when specifying a fuzzy query with a value  1?

 tiger~2
 animal~3

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Fuzzy-Query-Param-tp3120235p3120235.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Optimize taking two steps and extra disk space

2011-06-21 Thread Michael McCandless

OK that sounds like a good solution!

You can also have CMS limit how many merges are allowed to run at
once, if your IO system has trouble w/ that much concurrency.

Mike McCandless

http://blog.mikemccandless.com

On Mon, Jun 20, 2011 at 6:29 PM, Shawn Heisey s...@elyograg.org wrote:
 On 6/20/2011 3:18 PM, Michael McCandless wrote:

 With segmentsPerTier at 35 you will easily cross 70 segs in the index...
 If you want optimize to run in a single merge, I would lower
 sementsPerTier and mergeAtOnce (maybe back to the 10 default), and set
 your maxMergeAtOnceExplicit to 70 or higher...

 Lower mergeAtOnce means merges run more frequently but for shorter
 time, and, your searching should be faster (than 35/35) since there
 are fewer segments to visit.

 Thanks again for more detailed information.  There is method to my madness,
 which I will now try to explain.

 With a value of 10, the reindex involves enough merges that there is are
 many second level merges, and a third-level merge.  I was running into
 situations on my development platform (with its slow disks) where there were
 three merges happening at the same time, which caused all indexing activity
 to cease for several minutes.  This in turn would cause JDBC to time out and
 drop the connection to the database, which caused DIH to fail and rollback
 the entire import about two hours (two thirds) in.

 With a mergeFactor of 35, there are no second level merges, and no
 third-level merges.  I can do a complete reindex successfully even on a
 system with slow disks.

 In production, one shard (out of six) is optimized every day to eliminate
 deleted documents.  When I have to reindex everything, I will typically go
 through and manually optimize each shard in turn after it's done.  This is
 the point where I discovered this two-pass problem.

 I don't want to do a full-import with optimize=true, because all six large
 shards build at the same time in a Xen environment.  The I/O storm that
 results from three optimizes happening on each host at the same time and
 then replicating to similar Xen hosts is very bad.

 I have now set maxMergeAtOnceExplicit to 105.  I think that is probably
 enough, given that that I currently do not experience any second level
 merges.  When my index gets big enough, I will increase the ram buffer.  By
 then I will probably have more memory, so the first-level merges can still
 happen entirely from I/O cache.

 Shawn

Re: Optimize taking two steps and extra disk space

2011-06-21 Thread Michael McCandless

On Tue, Jun 21, 2011 at 9:42 AM, Shawn Heisey s...@elyograg.org wrote:
 On 6/20/2011 12:31 PM, Michael McCandless wrote:

 For back-compat, mergeFactor maps to both of these, but it's better to
 set them directly eg:

     mergePolicy class=org.apache.lucene.index.TieredMergePolicy
       int name=maxMergeAtOnce10/int
       int name=segmentsPerTier20/int
     /mergePolicy

 (and then remove your mergeFactor setting under indexDefaults)

 When I did this and ran a reindex, it merged once it reached 10 segments,
 despite what I had defined in the mergePolicy.  This is Solr 3.2 with the
 patch from SOLR-1972 applied.  I've included the config snippet below into
 solrconfig.xml using xinclude via another file.  I had to put mergeFactor
 back in to make it work right.  I haven't checked yet to see whether an
 optimize takes one pass.  That will be later today.

 mergePolicy class=org.apache.lucene.index.TieredMergePolicy
 int name=maxMergeAtOnce35/int
 int name=segmentsPerTier35/int
 int name=maxMergeAtOnceExplicit105/int
 /mergePolicy

Hmm something strange is going on.

In Solr 3.2, if you attempt to use mergeFactor and useCompoundFile
inside indexDefaults (and outside the mergePolicy), when your
mergePolicy is TMP, you should see a warning like this:

  Use of compound file format or mergefactor cannot be configured if
merge policy is not an instance of LogMergePolicy. The configured
policy's defaults will be used.

And it shouldn't work.  But, using the right params inside your
mergePolicy section ought to work (though, I don't think this is well
tested...).  I'm not sure why you're seeing the opposite of what I'd
expect...

I wonder if you're actually really getting the TMP?  Can you turn on
verbose IndexWriter infoStream and post the output?

Mike McCandless

http://blog.mikemccandless.com

Re: Optimize taking two steps and extra disk space

2011-06-20 Thread Michael McCandless

On Sun, Jun 19, 2011 at 12:35 PM, Shawn Heisey s...@elyograg.org wrote:
 On 6/19/2011 7:32 AM, Michael McCandless wrote:

 With LogXMergePolicy (the default before 3.2), optimize respects
 mergeFactor, so it's doing 2 steps because you have 37 segments but 35
 mergeFactor.

 With TieredMergePolicy (default on 3.2 and after), there is now a
 separate merge factor used for optimize (maxMergeAtOnceExplicit)... so
 you could eg set this factor higher and more often get a single merge
 for the optimize.

 This makes sense.  the default for maxMergeAtOnceExplicit is 30 according to
 LUCENE-854, so it merges the first 30 segments, then it goes back and merges
 the new one plus the other 7 that remain.  To counteract this behavior, I've
 put this in my solrconfig.xml, to test next week.

 mergePolicy class=org.apache.lucene.index.TieredMergePolicy
 int name=maxMergeAtOnceExplicit70/int
 /mergePolicy

 I figure that twice the megeFactor (35) will likely cover every possible
 outcome.  Is that a correct thought?

Actually, TieredMP has two different params (different from the
previous default LogMP):

  * segmentsPerTier controls how many segments you can tolerate in the
index (bigger number means more segments)

  * maxMergeAtOnce says how many segments can be merged at a time for
normal (not optimize) merging

For back-compat, mergeFactor maps to both of these, but it's better to
set them directly eg:

mergePolicy class=org.apache.lucene.index.TieredMergePolicy
  int name=maxMergeAtOnce10/int
  int name=segmentsPerTier20/int
/mergePolicy

(and then remove your mergeFactor setting under indexDefaults)

You should always have maxMergeAtOnce = segmentsPerTier else too much
merging will happen.

If you set segmentsPerTier to 35 than this can easily exceed 70
segments, so your optimize will again need more than one merge.  Note
that if you make the maxMergeAtOnce/Explicit too large then 1) you
risk running out of file handles (if you don't use compound file), and
2) merge performance likely gets worse as the OS is forced to splinter
its IO cache across more files (I suspect) and so more seeking will
happen.

Mike McCandless

http://blog.mikemccandless.com

Re: Optimize taking two steps and extra disk space

2011-06-20 Thread Michael McCandless

On Mon, Jun 20, 2011 at 4:00 PM, Shawn Heisey s...@elyograg.org wrote:
 On 6/20/2011 12:31 PM, Michael McCandless wrote:

 Actually, TieredMP has two different params (different from the
 previous default LogMP):

   * segmentsPerTier controls how many segments you can tolerate in the
 index (bigger number means more segments)

   * maxMergeAtOnce says how many segments can be merged at a time for
 normal (not optimize) merging

 For back-compat, mergeFactor maps to both of these, but it's better to
 set them directly eg:

     mergePolicy class=org.apache.lucene.index.TieredMergePolicy
       int name=maxMergeAtOnce10/int
       int name=segmentsPerTier20/int
     /mergePolicy

 (and then remove your mergeFactor setting under indexDefaults)

 You should always have maxMergeAtOnce= segmentsPerTier else too much
 merging will happen.

 If you set segmentsPerTier to 35 than this can easily exceed 70
 segments, so your optimize will again need more than one merge.  Note
 that if you make the maxMergeAtOnce/Explicit too large then 1) you
 risk running out of file handles (if you don't use compound file), and
 2) merge performance likely gets worse as the OS is forced to splinter
 its IO cache across more files (I suspect) and so more seeking will
 happen.

 Thanks much for the information!

 I've set my server up so that the user running the index has a soft limit of
 4096 files and a hard limit of 6144 files, and /proc/sys/fs/file-max is
 48409, so I should be OK on file handles.  The index is almost twice as big
 as available memory, so I'm not really worried about the I/O cache.  I've
 sized my mergFactor and ramBufferSizeMB so that the individual merges during
 indexing happen entirely from the I/O cache, which is the point where I
 really care about it.  There's nothing I can do about the optimize without
 spending a LOT of money.

 I will remove mergeFactor, set maxMergeAtOnce and segmentsPerTier to 35, and
 maxMergeAtOnceExplicit to 70.  If I ever run into a situation where it gets
 beyond 70 segments at any one time, I've probably got bigger problems than
 the number of passes my optimize takes, so I'll think about it then. :)
  Does that sound reasonable?

With segmentsPerTier at 35 you will easily cross 70 segs in the index...

If you want optimize to run in a single merge, I would lower
sementsPerTier and mergeAtOnce (maybe back to the 10 default), and set
your maxMergeAtOnceExplicit to 70 or higher...

Lower mergeAtOnce means merges run more frequently but for shorter
time, and, your searching should be faster (than 35/35) since there
are fewer segments to visit.

Mike McCandless

http://blog.mikemccandless.com

Re: Optimize taking two steps and extra disk space

2011-06-19 Thread Michael McCandless

With LogXMergePolicy (the default before 3.2), optimize respects
mergeFactor, so it's doing 2 steps because you have 37 segments but 35
mergeFactor.

With TieredMergePolicy (default on 3.2 and after), there is now a
separate merge factor used for optimize (maxMergeAtOnceExplicit)... so
you could eg set this factor higher and more often get a single merge
for the optimize.

Mike McCandless

http://blog.mikemccandless.com

On Sat, Jun 18, 2011 at 6:45 PM, Shawn Heisey s...@elyograg.org wrote:
 I've noticed something odd in Solr 3.2 when it does an optimize.  One of my
 shards (freshly built via DIH full-import) had 37 segments, totalling
 17.38GB of disk space.  13 of those segments were results of merges during
 initial import, the other 24 were untouched after creation.  Starting at _0,
 the final segment before optimizing is _co.  The mergefactor on the index is
 35, chosen because it makes merged segments line up nicely on z
 boundaries.

 The optmization process created a _cp segment of 14.4GB, followed by a _cq
 segment at the final 17.27GB size, so at the peak, it took 49GB of disk
 space to hold the index.

 Is there any way to make it do the optimize in one pass?  Is there a
 compelling reason why it does it this way?

 Thanks,
 Shawn

Re: Field Collapsing and Grouping in Solr 3.2

2011-06-16 Thread Michael McCandless

Alas, no, not yet.. grouping/field collapse has had a long history
with Solr.

There were many iterations on SOLR-236, but that impl was never
committed.  Instead, SOLR-1682 was committed, but committed only to
trunk (never backported to 3.x despite requests).

Then, a new grouping module was factored out of Solr's trunk
implementation, and was backported to 3.x.

Finally, there is now an effort to cut over Solr trunk (SOLR-2564) and
Solr 3.x (SOLR-2524) to the new grouping module, which looks like it's
close to being done!

So hopefully for 3.3 but not promises!  This is open-source...

Mike McCandless

http://blog.mikemccandless.com


2011/6/16 Sergio Martín sergio.mar...@playence.com

 Hello.



 Does anybody know if Field Collapsing and Grouping is available in Solr
 3.2.

 I mean directly available, not as a patch.



 I have read conflicting statements about it...



 Thanks a lot!





 [image: Description: playence] http://www.playence.com/

 *Sergio Martín Cantero*

 *playence KG*

 Penthouse office Soho II - Top 1

 Grabenweg 68

 6020 Innsbruck

 Austria

 Mobile: (+34)654464222

 eMail:  sergio.mar...@playence.com

 Web:www.playence.com



 [image: Description: skypeplayence]  [image: Description: 
 twitterplayence]http://twitter.com/playence
   [image: Description: 
 linkedinplayence]http://www.linkedin.com/companies/playence



 Stay up to date on the latest developments of playence by subscribing to
 our blog (http://blog.playence.com) or following us in Twitter (
 http://twitter.com/playence).

 The information in this e-mail is confidential and may be legally
 privileged. It is intended solely for the addressee and access to the e-mail
 by anyone else is unauthorized. If you are not the intended recipient, any
 disclosure, copying, distribution or any action taken or omitted to be taken
 in reliance on it, is prohibited and may be unlawful. If you have received
 this e-mail in error please forward to off...@playence.com.  Thank you for
 your cooperation.

1 2 3 >

1 - 100 of 231 matches

Mail list logo