You could also try the new[ish] PostingsHighlighter: http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html
Mike McCandless http://blog.mikemccandless.com On Sat, Jun 15, 2013 at 8:50 AM, Michael Sokolov <msoko...@safaribooksonline.com> wrote: > If you have very large documents (many MB) that can lead to slow > highlighting, even with FVH. > > See https://issues.apache.org/jira/browse/LUCENE-3234 > > and try setting phraseLimit=1 (or some bigger number, but not infinite, > which is the default) > > -Mike > > > > On 6/14/13 4:52 PM, Andy Brown wrote: >> >> Bryan, >> >> For specifics, I'll refer you back to my original email where I >> specified all the fields/field types/handlers I use. Here's a general >> overview. >> I really only have 3 fields that I index and search against: "name", >> "description", and "content". All of which are just general text >> (string) fields. I have a catch-all field called "text" that is only >> used for querying. It's indexed but not stored. The "name", >> "description", and "content" fields are copied into the "text" field. >> For partial word matching, I have 4 more fields: "name_par", >> "description_par", "content_par", and "text_par". The "text_par" field >> has the same relationship to the "*_par" fields as "text" does to the >> others (only used for querying). Those partial word matching fields are >> of type "text_general_partial" which I created. That field type is >> analyzed different than the regular text field in that it goes through >> an EdgeNGramFilterFactory with the minGramSize="2" and maxGramSize="7" >> at index time. >> I query against both "text" and "text_par" fields using edismax deftype >> with my qf set to "text^2 text_par^1" to give full word matches a higher >> score. This part returns back very fast as previously stated. It's when >> I turn on highlighting that I take the huge performance hit. >> Again, I'm using the FastVectorHighlighting. The hl.fl is set to "name >> name_par description description_par content content_par" so that it >> returns highlights for full and partial word matches. All of those >> fields have indexed, stored, termPositions, termVectors, and termOffsets >> set to "true". >> It all seems redundant just to allow for partial word >> matching/highlighting but I didn't know of a better way. Does anything >> stand out to you that could be the culprit? Let me know if you need any >> more clarification. >> Thanks! >> - Andy >> >> -----Original Message----- >> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] >> Sent: Wednesday, May 29, 2013 5:44 PM >> To: solr-user@lucene.apache.org >> Subject: RE: Slow Highlighter Performance Even Using >> FastVectorHighlighter >> >> Andy, >> >>> I don't understand why it's taking 7 secs to return highlights. The >> >> size >>> >>> of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set >> >> to >>> >>> 1024 for this verification purpose and that should be more than >> >> enough. >>> >>> The processor is plenty powerful enough as well. >>> >>> Running VisualVM shows all my CPU time being taken by mainly these 3 >>> methods: >>> >>> >> org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI >>> >>> nfo.getStartOffset() >>> >> org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI >>> >>> nfo.getStartOffset() >>> >> org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap( >>> >>> ) >> >> That is a strange and interesting set of things to be spending most of >> your CPU time on. The implication, I think, is that the number of term >> matches in the document for terms in your query (or, at least, terms >> matching exact words or the beginning of phrases in your query) is >> extremely high . Perhaps that's coming from this "partial word match" >> you >> mention -- how does that work? >> >> -- Bryan >> >>> My guess is that this has something to do with how I'm handling >> >> partial >>> >>> word matches/highlighting. I have setup another request handler that >>> only searches the whole word fields and it returns in 850 ms with >>> highlighting. >>> >>> Any ideas? >>> >>> - Andy >>> >>> >>> -----Original Message----- >>> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com] >>> Sent: Monday, May 20, 2013 1:39 PM >>> To: solr-user@lucene.apache.org >>> Subject: RE: Slow Highlighter Performance Even Using >>> FastVectorHighlighter >>> >>> My guess is that the problem is those 200M documents. >>> FastVectorHighlighter is fast at deciding whether a match, especially >> >> a >>> >>> phrase, appears in a document, but it still starts out by walking the >>> entire list of term vectors, and ends by breaking the document into >>> candidate-snippet fragments, both processes that are proportional to >> >> the >>> >>> length of the document. >>> >>> It's hard to do much about the first, but for the second you could >>> choose >>> to expose FastVectorHighlighter's FieldPhraseList representation, and >>> return offsets to the caller rather than fragments, building up your >> >> own >>> >>> snippets from a separate store of indexed files. This would also >> >> permit >>> >>> you to set stored="false", improving your memory/core size ratio, >> >> which >>> >>> I'm guessing could use some improving. It would require some work, and >>> it >>> would require you to store a representation of what was indexed >> >> outside >>> >>> the Solr core, in some constant-bytes-to-character representation that >>> you >>> can use offsets with (e.g. UTF-16, or ASCII+entity references). >>> >>> However, you may not need to do this -- it may be that you just need >>> more >>> memory for your search machine. Not JVM memory, but memory that the >> >> O/S >>> >>> can use as a file cache. What do you have now? That is, how much >> >> memory >>> >>> do >>> you have that is not used by the JVM or other apps, and how big is >> >> your >>> >>> Solr core? >>> >>> One way to start getting a handle on where time is being spent is to >> >> set >>> >>> up VisualVM. Turn on CPU sampling, send in a bunch of the slow >> >> highlight >>> >>> queries, and look at where the time is being spent. If it's mostly in >>> methods that are just reading from disk, buy more memory. If you're on >>> Linux, look at what top is telling you. If the CPU usage is low and >> >> the >>> >>> "wa" number is above 1% more often than not, buy more memory (I don't >>> know >>> why that wa number makes sense, I just know that it has been a good >> >> rule >>> >>> of thumb for us). >>> >>> -- Bryan >>> >>>> -----Original Message----- >>>> From: Andy Brown [mailto:andy_br...@rhoworld.com] >>>> Sent: Monday, May 20, 2013 9:53 AM >>>> To: solr-user@lucene.apache.org >>>> Subject: Slow Highlighter Performance Even Using >> >> FastVectorHighlighter >>>> >>>> I'm providing a search feature in a web app that searches for >>> >>> documents >>>> >>>> that range in size from 1KB to 200MB of varying MIME types (PDF, >> >> DOC, >>>> >>>> etc). Currently there are about 3000 documents and this will >> >> continue >>> >>> to >>>> >>>> grow. I'm providing full word search and partial word search. For >> >> each >>>> >>>> document, there are three source fields that I'm interested in >>> >>> searching >>>> >>>> and highlighting on: name, description, and content. Since I'm >>> >>> providing >>>> >>>> both full and partial word search, I've created additional fields >> >> that >>>> >>>> get tokenized differently: name_par, description_par, and >> >> content_par. >>>> >>>> Those are indexed and stored as well for querying and highlighting. >> >> As >>>> >>>> suggested in the Solr wiki, I've got two catch all fields text and >>>> text_par for faster querying. >>>> >>>> An average search results page displays 25 results and I provide >>> >>> paging. >>>> >>>> I'm just returning the doc ID in my Solr search results and response >>>> times have been quite good (1 to 10 ms). The problem in performance >>>> occurs when I turn on highlighting. I'm already using the >>>> FastVectorHighlighter and depending on the query, it has taken as >> >> long >>>> >>>> as 15 seconds to get the highlight snippets. However, this isn't >>> >>> always >>>> >>>> the case. Certain query terms result in 1 sec or less response time. >>> >>> In >>>> >>>> any case, 15 seconds is way too long. >>>> >>>> I'm fairly new to Solr but I've spent days coming up with what I've >>> >>> got >>>> >>>> so far. Feel free to correct any misconceptions I have. Can anyone >>>> advise me on what I'm doing wrong or offer a better way to setup my >>> >>> core >>>> >>>> to improve highlighting performance? >>>> >>>> A typical query would look like: >>>> /select?q=foo&start=0&rows=25&fl=id&hl=true >>>> >>>> I'm using Solr 4.1. Below the relevant core schema and config >> >> details: >>>> >>>> <!-- Misc fields --> >>>> <field name="_version_" type="long" indexed="true" stored="true"/> >>>> <field name="id" type="string" indexed="true" stored="true" >>>> required="true" multiValued="false"/> >>>> >>>> >>>> <!-- Fields for whole word matches --> >>>> <field name="name" type="text_general" indexed="true" stored="true" >>>> multiValued="true" termPositions="true" termVectors="true" >>>> termOffsets="true"/> >>>> <field name="description" type="text_general" indexed="true" >>>> stored="true" multiValued="true" termPositions="true" >>> >>> termVectors="true" >>>> >>>> termOffsets="true"/> >>>> <field name="content" type="text_general" indexed="true" >> >> stored="true" >>>> >>>> multiValued="true" termPositions="true" termVectors="true" >>>> termOffsets="true"/> >>>> <field name="text" type="text_general" indexed="true" stored="false" >>>> multiValued="true"/> >>>> >>>> <!-- Fields for partial word matches --> >>>> <field name="name_par" type="text_general_partial" indexed="true" >>>> stored="true" multiValued="true" termPositions="true" >>> >>> termVectors="true" >>>> >>>> termOffsets="true"/> >>>> <field name="description_par" type="text_general_partial" >>> >>> indexed="true" >>>> >>>> stored="true" multiValued="true" termPositions="true" >>> >>> termVectors="true" >>>> >>>> termOffsets="true"/> >>>> <field name="content_par" type="text_general_partial" indexed="true" >>>> stored="true" multiValued="true" termPositions="true" >>> >>> termVectors="true" >>>> >>>> termOffsets="true"/> >>>> <field name="text_par" type="text_general_partial" indexed="true" >>>> stored="false" multiValued="true"/> >>>> >>>> >>>> <!-- Copy source name, description, and content fields to name_par, >>>> description_par, and content_par for partial word searches --> >>>> <copyField source="name" dest="name_par"/> >>>> <copyField source="description" dest="description_par"/> >>>> <copyField source="content" dest="content_par"/> >>>> >>>> <!-- Copy source name, description, and content fields to catch-all >>> >>> text >>>> >>>> field for faster querying. --> >>>> <copyField source="name" dest="text"/> >>>> <copyField source="description" dest="text"/> >>>> <copyField source="content" dest="text"/> >>>> >>>> <!-- Copy source name, description, and content fields to catch-all >>>> text_par field for faster querying of partial word searches. --> >>>> <copyField source="name" dest="text_par"/> >>>> <copyField source="description" dest="text_par"/> >>>> <copyField source="content" dest="text_par"/> >>>> >>>> <!-- A text field for whole word matches --> >>>> <fieldType name="text_general" class="solr.TextField" >>>> positionIncrementGap="100"> >>>> <analyzer type="index"> >>>> <tokenizer class="solr.StandardTokenizerFactory"/> >>>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>>> words="stopwords.txt" enablePositionIncrements="true" /> >>>> <filter class="solr.LowerCaseFilterFactory"/> >>>> </analyzer> >>>> <analyzer type="query"> >>>> <tokenizer class="solr.StandardTokenizerFactory"/> >>>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>>> words="stopwords.txt" enablePositionIncrements="true" /> >>>> <filter class="solr.SynonymFilterFactory" >> >> synonyms="synonyms.txt" >>>> >>>> ignoreCase="true" expand="true"/> >>>> <filter class="solr.LowerCaseFilterFactory"/> >>>> </analyzer> >>>> </fieldType> >>>> >>>> <!-- A text field for parital matches --> >>>> <fieldType name="text_general_partial" class="solr.TextField" >>>> positionIncrementGap="100"> >>>> <analyzer type="index"> >>>> <tokenizer class="solr.StandardTokenizerFactory"/> >>>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>>> words="stopwords.txt" enablePositionIncrements="true" /> >>>> <filter class="solr.LowerCaseFilterFactory"/> >>>> <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" >>>> maxGramSize="7"/> >>>> </analyzer> >>>> <analyzer type="query"> >>>> <tokenizer class="solr.StandardTokenizerFactory"/> >>>> <filter class="solr.StopFilterFactory" ignoreCase="true" >>>> words="stopwords.txt" enablePositionIncrements="true" /> >>>> <filter class="solr.SynonymFilterFactory" >> >> synonyms="synonyms.txt" >>>> >>>> ignoreCase="true" expand="true"/> >>>> <filter class="solr.LowerCaseFilterFactory"/> >>>> </analyzer> >>>> </fieldType> >>>> >>>> >>>> >>>> <requestHandler name="/select" class="solr.SearchHandler"> >>>> <!-- default values for query parameters can be specified, these >>>> will be overridden by parameters in the request. --> >>>> <lst name="defaults"> >>>> <str name="echoParams">explicit</str> >>>> <int name="rows">10</int> >>>> <str name="df">text</str> >>>> <str name="defType">edismax</str> >>>> <str name="qf">text^2 text_par^1</str> <!-- Boost whole >>>> word matches more than partial matches in the scroing. --> >>>> <bool name="termVectors">true</bool> >>>> <bool name="termPositions">true</bool> >>>> <bool name="termOffsets">true</bool> >>>> <bool name="hl.useFastVectorHighlighter">true</bool> >>>> <str name="hl.boundaryScanner">breakIterator</str> >>>> <str name="hl.snippets">2</str> >>>> <str name="hl.fl">name name_par description description_par >>>> content content_par</str> >>>> <int name="hl.fragsize">162</int> >>>> <str name="hl.fragListBuilder">simple</str> >>>> <str name="hl.fragmentsBuilder">default</str> >>>> <str name="hl.simple.pre"><![CDATA[<strong>]]></str> >>>> <str name="hl.simple.post"><![CDATA[</strong>]]></str> >>>> <str name="hl.tag.pre"><![CDATA[<strong>]]></str> >>>> <str name="hl.tag.post"><![CDATA[</strong>]]></str> >>>> </lst> >>>> </requestHandler> >>>> >>>> >>>> Cheers! >>>> >>>> - Andy > >