You could also try the new[ish] PostingsHighlighter:
http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html

Mike McCandless

http://blog.mikemccandless.com


On Sat, Jun 15, 2013 at 8:50 AM, Michael Sokolov
<msoko...@safaribooksonline.com> wrote:
> If you have very large documents (many MB) that can lead to slow
> highlighting, even with FVH.
>
> See https://issues.apache.org/jira/browse/LUCENE-3234
>
> and try setting phraseLimit=1 (or some bigger number, but not infinite,
> which is the default)
>
> -Mike
>
>
>
> On 6/14/13 4:52 PM, Andy Brown wrote:
>>
>> Bryan,
>>
>> For specifics, I'll refer you back to my original email where I
>> specified all the fields/field types/handlers I use. Here's a general
>> overview.
>>   I really only have 3 fields that I index and search against: "name",
>> "description", and "content". All of which are just general text
>> (string) fields. I have a catch-all field called "text" that is only
>> used for querying. It's indexed but not stored. The "name",
>> "description", and "content" fields are copied into the "text" field.
>>   For partial word matching, I have 4 more fields: "name_par",
>> "description_par", "content_par", and "text_par". The "text_par" field
>> has the same relationship to the "*_par" fields as "text" does to the
>> others (only used for querying). Those partial word matching fields are
>> of type "text_general_partial" which I created. That field type is
>> analyzed different than the regular text field in that it goes through
>> an EdgeNGramFilterFactory with the minGramSize="2" and maxGramSize="7"
>> at index time.
>>   I query against both "text" and "text_par" fields using edismax deftype
>> with my qf set to "text^2 text_par^1" to give full word matches a higher
>> score. This part returns back very fast as previously stated. It's when
>> I turn on highlighting that I take the huge performance hit.
>>   Again, I'm using the FastVectorHighlighting. The hl.fl is set to "name
>> name_par description description_par content content_par" so that it
>> returns highlights for full and partial word matches. All of those
>> fields have indexed, stored, termPositions, termVectors, and termOffsets
>> set to "true".
>>   It all seems redundant just to allow for partial word
>> matching/highlighting but I didn't know of a better way. Does anything
>> stand out to you that could be the culprit? Let me know if you need any
>> more clarification.
>>   Thanks!
>>   - Andy
>>
>> -----Original Message-----
>> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
>> Sent: Wednesday, May 29, 2013 5:44 PM
>> To: solr-user@lucene.apache.org
>> Subject: RE: Slow Highlighter Performance Even Using
>> FastVectorHighlighter
>>
>> Andy,
>>
>>> I don't understand why it's taking 7 secs to return highlights. The
>>
>> size
>>>
>>> of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set
>>
>> to
>>>
>>> 1024 for this verification purpose and that should be more than
>>
>> enough.
>>>
>>> The processor is plenty powerful enough as well.
>>>
>>> Running VisualVM shows all my CPU time being taken by mainly these 3
>>> methods:
>>>
>>>
>> org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
>>>
>>> nfo.getStartOffset()
>>>
>> org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
>>>
>>> nfo.getStartOffset()
>>>
>> org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(
>>>
>>> )
>>
>> That is a strange and interesting set of things to be spending most of
>> your CPU time on. The implication, I think, is that the number of term
>> matches in the document for terms in your query (or, at least, terms
>> matching exact words or the beginning of phrases in your query) is
>> extremely high . Perhaps that's coming from this "partial word match"
>> you
>> mention -- how does that work?
>>
>> -- Bryan
>>
>>> My guess is that this has something to do with how I'm handling
>>
>> partial
>>>
>>> word matches/highlighting. I have setup another request handler that
>>> only searches the whole word fields and it returns in 850 ms with
>>> highlighting.
>>>
>>> Any ideas?
>>>
>>> - Andy
>>>
>>>
>>> -----Original Message-----
>>> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
>>> Sent: Monday, May 20, 2013 1:39 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: RE: Slow Highlighter Performance Even Using
>>> FastVectorHighlighter
>>>
>>> My guess is that the problem is those 200M documents.
>>> FastVectorHighlighter is fast at deciding whether a match, especially
>>
>> a
>>>
>>> phrase, appears in a document, but it still starts out by walking the
>>> entire list of term vectors, and ends by breaking the document into
>>> candidate-snippet fragments, both processes that are proportional to
>>
>> the
>>>
>>> length of the document.
>>>
>>> It's hard to do much about the first, but for the second you could
>>> choose
>>> to expose FastVectorHighlighter's FieldPhraseList representation, and
>>> return offsets to the caller rather than fragments, building up your
>>
>> own
>>>
>>> snippets from a separate store of indexed files. This would also
>>
>> permit
>>>
>>> you to set stored="false", improving your memory/core size ratio,
>>
>> which
>>>
>>> I'm guessing could use some improving. It would require some work, and
>>> it
>>> would require you to store a representation of what was indexed
>>
>> outside
>>>
>>> the Solr core, in some constant-bytes-to-character representation that
>>> you
>>> can use offsets with (e.g. UTF-16, or ASCII+entity references).
>>>
>>> However, you may not need to do this -- it may be that you just need
>>> more
>>> memory for your search machine. Not JVM memory, but memory that the
>>
>> O/S
>>>
>>> can use as a file cache. What do you have now? That is, how much
>>
>> memory
>>>
>>> do
>>> you have that is not used by the JVM or other apps, and how big is
>>
>> your
>>>
>>> Solr core?
>>>
>>> One way to start getting a handle on where time is being spent is to
>>
>> set
>>>
>>> up VisualVM. Turn on CPU sampling, send in a bunch of the slow
>>
>> highlight
>>>
>>> queries, and look at where the time is being spent. If it's mostly in
>>> methods that are just reading from disk, buy more memory. If you're on
>>> Linux, look at what top is telling you. If the CPU usage is low and
>>
>> the
>>>
>>> "wa" number is above 1% more often than not, buy more memory (I don't
>>> know
>>> why that wa number makes sense, I just know that it has been a good
>>
>> rule
>>>
>>> of thumb for us).
>>>
>>> -- Bryan
>>>
>>>> -----Original Message-----
>>>> From: Andy Brown [mailto:andy_br...@rhoworld.com]
>>>> Sent: Monday, May 20, 2013 9:53 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Slow Highlighter Performance Even Using
>>
>> FastVectorHighlighter
>>>>
>>>> I'm providing a search feature in a web app that searches for
>>>
>>> documents
>>>>
>>>> that range in size from 1KB to 200MB of varying MIME types (PDF,
>>
>> DOC,
>>>>
>>>> etc). Currently there are about 3000 documents and this will
>>
>> continue
>>>
>>> to
>>>>
>>>> grow. I'm providing full word search and partial word search. For
>>
>> each
>>>>
>>>> document, there are three source fields that I'm interested in
>>>
>>> searching
>>>>
>>>> and highlighting on: name, description, and content. Since I'm
>>>
>>> providing
>>>>
>>>> both full and partial word search, I've created additional fields
>>
>> that
>>>>
>>>> get tokenized differently: name_par, description_par, and
>>
>> content_par.
>>>>
>>>> Those are indexed and stored as well for querying and highlighting.
>>
>> As
>>>>
>>>> suggested in the Solr wiki, I've got two catch all fields text and
>>>> text_par for faster querying.
>>>>
>>>> An average search results page displays 25 results and I provide
>>>
>>> paging.
>>>>
>>>> I'm just returning the doc ID in my Solr search results and response
>>>> times have been quite good (1 to 10 ms). The problem in performance
>>>> occurs when I turn on highlighting. I'm already using the
>>>> FastVectorHighlighter and depending on the query, it has taken as
>>
>> long
>>>>
>>>> as 15 seconds to get the highlight snippets. However, this isn't
>>>
>>> always
>>>>
>>>> the case. Certain query terms result in 1 sec or less response time.
>>>
>>> In
>>>>
>>>> any case, 15 seconds is way too long.
>>>>
>>>> I'm fairly new to Solr but I've spent days coming up with what I've
>>>
>>> got
>>>>
>>>> so far. Feel free to correct any misconceptions I have. Can anyone
>>>> advise me on what I'm doing wrong or offer a better way to setup my
>>>
>>> core
>>>>
>>>> to improve highlighting performance?
>>>>
>>>> A typical query would look like:
>>>> /select?q=foo&start=0&rows=25&fl=id&hl=true
>>>>
>>>> I'm using Solr 4.1. Below the relevant core schema and config
>>
>> details:
>>>>
>>>> <!-- Misc fields -->
>>>> <field name="_version_" type="long" indexed="true" stored="true"/>
>>>> <field name="id" type="string" indexed="true" stored="true"
>>>> required="true" multiValued="false"/>
>>>>
>>>>
>>>> <!-- Fields for whole word matches -->
>>>> <field name="name" type="text_general" indexed="true" stored="true"
>>>> multiValued="true" termPositions="true" termVectors="true"
>>>> termOffsets="true"/>
>>>> <field name="description" type="text_general" indexed="true"
>>>> stored="true" multiValued="true" termPositions="true"
>>>
>>> termVectors="true"
>>>>
>>>> termOffsets="true"/>
>>>> <field name="content" type="text_general" indexed="true"
>>
>> stored="true"
>>>>
>>>> multiValued="true" termPositions="true" termVectors="true"
>>>> termOffsets="true"/>
>>>> <field name="text" type="text_general" indexed="true" stored="false"
>>>> multiValued="true"/>
>>>>
>>>> <!-- Fields for partial word matches -->
>>>> <field name="name_par" type="text_general_partial" indexed="true"
>>>> stored="true" multiValued="true" termPositions="true"
>>>
>>> termVectors="true"
>>>>
>>>> termOffsets="true"/>
>>>> <field name="description_par" type="text_general_partial"
>>>
>>> indexed="true"
>>>>
>>>> stored="true" multiValued="true" termPositions="true"
>>>
>>> termVectors="true"
>>>>
>>>> termOffsets="true"/>
>>>> <field name="content_par" type="text_general_partial" indexed="true"
>>>> stored="true" multiValued="true" termPositions="true"
>>>
>>> termVectors="true"
>>>>
>>>> termOffsets="true"/>
>>>> <field name="text_par" type="text_general_partial" indexed="true"
>>>> stored="false" multiValued="true"/>
>>>>
>>>>
>>>> <!-- Copy source name, description, and content fields to name_par,
>>>> description_par, and content_par for partial word searches -->
>>>> <copyField source="name" dest="name_par"/>
>>>> <copyField source="description" dest="description_par"/>
>>>> <copyField source="content" dest="content_par"/>
>>>>
>>>> <!-- Copy source name, description, and content fields to catch-all
>>>
>>> text
>>>>
>>>> field for faster querying. -->
>>>> <copyField source="name" dest="text"/>
>>>> <copyField source="description" dest="text"/>
>>>> <copyField source="content" dest="text"/>
>>>>
>>>> <!-- Copy source name, description, and content fields to catch-all
>>>> text_par field for faster querying of partial word searches. -->
>>>> <copyField source="name" dest="text_par"/>
>>>> <copyField source="description" dest="text_par"/>
>>>> <copyField source="content" dest="text_par"/>
>>>>
>>>> <!-- A text field for whole word matches -->
>>>> <fieldType name="text_general" class="solr.TextField"
>>>> positionIncrementGap="100">
>>>>    <analyzer type="index">
>>>>      <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords.txt" enablePositionIncrements="true" />
>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>    </analyzer>
>>>>    <analyzer type="query">
>>>>      <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords.txt" enablePositionIncrements="true" />
>>>>      <filter class="solr.SynonymFilterFactory"
>>
>> synonyms="synonyms.txt"
>>>>
>>>> ignoreCase="true" expand="true"/>
>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>     </analyzer>
>>>>   </fieldType>
>>>>
>>>> <!-- A text field for parital matches -->
>>>> <fieldType name="text_general_partial" class="solr.TextField"
>>>> positionIncrementGap="100">
>>>>    <analyzer type="index">
>>>>      <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords.txt" enablePositionIncrements="true" />
>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>         <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
>>>> maxGramSize="7"/>
>>>>    </analyzer>
>>>>    <analyzer type="query">
>>>>      <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>      <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords.txt" enablePositionIncrements="true" />
>>>>      <filter class="solr.SynonymFilterFactory"
>>
>> synonyms="synonyms.txt"
>>>>
>>>> ignoreCase="true" expand="true"/>
>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>    </analyzer>
>>>> </fieldType>
>>>>
>>>>
>>>>
>>>> <requestHandler name="/select" class="solr.SearchHandler">
>>>>      <!-- default values for query parameters can be specified, these
>>>> will be overridden by parameters in the request. -->
>>>>       <lst name="defaults">
>>>>         <str name="echoParams">explicit</str>
>>>>         <int name="rows">10</int>
>>>>         <str name="df">text</str>
>>>>            <str name="defType">edismax</str>
>>>>            <str name="qf">text^2 text_par^1</str>   <!-- Boost whole
>>>> word matches more than partial matches in the scroing. -->
>>>>            <bool name="termVectors">true</bool>
>>>>         <bool name="termPositions">true</bool>
>>>>         <bool name="termOffsets">true</bool>
>>>>         <bool name="hl.useFastVectorHighlighter">true</bool>
>>>>         <str name="hl.boundaryScanner">breakIterator</str>
>>>>         <str name="hl.snippets">2</str>
>>>>            <str name="hl.fl">name name_par description description_par
>>>> content content_par</str>
>>>>         <int name="hl.fragsize">162</int>
>>>>            <str name="hl.fragListBuilder">simple</str>
>>>>         <str name="hl.fragmentsBuilder">default</str>
>>>>         <str name="hl.simple.pre"><![CDATA[<strong>]]></str>
>>>>         <str name="hl.simple.post"><![CDATA[</strong>]]></str>
>>>>            <str name="hl.tag.pre"><![CDATA[<strong>]]></str>
>>>>         <str name="hl.tag.post"><![CDATA[</strong>]]></str>
>>>>       </lst>
>>>>   </requestHandler>
>>>>
>>>>
>>>> Cheers!
>>>>
>>>> - Andy
>
>

Reply via email to