date:20100723

Getting the offset of search keyword in a document

2010-07-23 Thread Ryan Chan

Hello,

I am new to Solr/Lucene and I am evaluating if they suit my need and
replace our in-house system.


Our requirements:

1. I have multiple documents (1M)
2. Each document contains text ranged from few KB to a few MB
3. I want to search for a keyword, search thru all theses document,
and it return the matched document(s), AND ALSO the offset of that
'keyword' inside the document.

Is it possible for requirement 3?

Re: Autocommit not happening

2010-07-23 Thread John DeRosa

I'll see you, and raise. My solrconfig.xml wasn't being copied to the server by 
the deployment script.

On Jul 23, 2010, at 3:26 PM, Jay Luker wrote:

> For the sake of any future googlers I'll report my own clueless but
> thankfully brief struggle with autocommit.
> 
> There are two parts to the story: Part One is where I realize my
>  config was not contained within my . In
> Part Two I realized I had typed "" rather than
> "".
> 
> --jay
> 
> On Fri, Jul 23, 2010 at 2:35 PM, John DeRosa  wrote:
>> On Jul 23, 2010, at 9:37 AM, John DeRosa wrote:
>> 
>>> Hi! I'm a Solr newbie, and I don't understand why autocommits aren't 
>>> happening in my Solr installation.
>>> 
>> 
>> [snip]
>> 
>> "Never mind"... I have discovered my boneheaded mistake. It's so silly, I 
>> wish I could retract my question from the archives.
>> 
>>

SOLR Memory Usage - Where does it go?

2010-07-23 Thread Stephen Weiss

We have been having problems with SOLR on one project lately. Forgive
me for writing a novel here but it's really important that we identify
the root cause of this issue. It is becoming unavailable at random
intervals, and the problem appears to be memory related. There are
basically two ways it goes:

1) Straight up OOM error, either from Java or sometimes from the
kernel itself.

2) Instead of throwing an OOM, the memory usage gets very high and
then drops precipitously (say, from 92% (of 20GB) down to 60%). Once
the memory usage is done dropping, SOLR seems to stop responding to
requests altogether.

It started out mostly being version #1 of the problem but now we're
mostly seeing version #2 of the problem... and it's getting more and
more frequent. In either scenario the servlet container (Jetty) needs
to be restarted to resume service.

The number of documents in the index is always going up. They are
relatively small in size (1K per piece max - mostly small numeric
strings, with 5 text fields (one each for 5 languages) that are rarely
more than 50-100 characters), and there are about 5 million of them at
the moment (adding around 1000 every day). The machine has 20 GB of
RAM, Xmx is set to 18GB, and SOLR is the only thing this machine /
servlet container does. There are a couple other cores configured,
but they are miniscule in comparison (one with 20 docs, and two
more with < 1 docs a piece). Eliminating these other cores does
not seem to make any significant impact. This is with the SOLR 1.4.1
release, using the SOLR-236 patch that was recently released to go
with this version. The patch was slightly modified in order to ensure
that paging continued to work properly - basically, an optimization
that eliminated paging was removed per the instructions in this comment:

https://issues.apache.org/jira/browse/SOLR-236?focusedCommentId=12867680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel
#action_12867680

I realize this is not ideal if you want to control memory usage, but
the design requirements of the project preclude us from eliminating
either collapsing or paging. It's also probably worth noting that
these problems did not start with version 1.4.1 or this version of the
236 patch - we actually upgraded from 1.4 because they said it fixed
some memory leaks, hoping it would help solve this problem.

We have some test machines set up and we have been testing out various
configuration changes. Watching the stats in the admin area, this is
what we've been able to figure out:

1) The fieldValueCache usage stays constant at 23 entries (one for
each faceted field), and takes up a total size of about 750MB
altogether.

2) Lowering or just eliminating the filterCache and the
queryResultCache does not seem to have any serious impact - perhaps a
difference of a few percent at the start, but after prolonged usage
the memory still goes up seemingly uncontrolled. It would appear the
queryResultCache does not get much usage anyway, and even though we
have higher eviction rates in the filterCache, this really doesn't
seem to impact performance significantly.

3) Lowering or eliminating the documentCache also doesn't seem to have
very much impact in memory usage, although it does make searches much
slower.

4) We followed the instructions for configuring the HashDocSet
parameter, but this doesn't seem to be having much impact either.

5) All the caches, with the exception of the documentCache, are
FastLRUCaches. Switching between FastLRUCache and normal LRUCache in
general doesn't seem to change the memory usage.

6) Glancing through all of the data on memory usage in the Lucene
fieldCache would indicate that this cache is using well under 1GB of
RAM as well.

Basically, when the servlet first starts, it uses very little RAM
(<4%). We warm the searcher with a few standard queries that
initialize everything in the fieldValueCache off the bat, and the
query performance levels off at a reasonable speed, with memory usage
around 10-12%. At this point, almost all queries execute within a few
100ms, if not faster. A very few queries that return large numbers of
collapsed documents, generally 800K up to about 2 million (we have
about 5 distinct queries that do this), will take up to 20 seconds to
run the first time, and up to 10 seconds thereafter. Even after
running all these queries, memory usage stays around 20-30%. At this
point, performance is optimal. We simulate production usage, running
queries taken from those logs through the system at a rate similar to
production use.

For the most part, memory usage stays level. Usage will go up as
queries are run (this seems to correspond with when they are being
collapsed), but then go back down as the results are returned. Then,
over the course of a few hours, at seemingly random int

67 matches

Mail list logo