Hello Teague, Thanks for your reply and tip! I think Solr will give me a better result than just using Tika to read up my files and send to a Fulltext Index in my MySQL, which has the precise point of not highlighting the text snippets...
So, I will keep on trying to fix Solr to my needs, and sure it works... I am missing something. Thanks again and I will keep on track. When I find the solution I will post all files and configs here for future references. Best regards, *Evert* 2015-12-17 6:11 GMT-02:00 Teague James <teag...@insystechinc.com>: > Erik's comments not withstanding, there are some gaps in my understanding > of your precise situation. Here's a few things that weren't necessarily > obvious to me when I took my first try with Solr. > > Highlighting is the end result of a good hit. It is essentially formatting > applied to your hit. It is possible to get a hit without a highlight if > certain conditions exist. > > First, start by making sure you are indexing your target (a PDF file?) > correctly. Assuming you are indexing PDFs, are you extracting meta data > only or are you parsing the document with Tika? If you want hits on the > contents of your PDF, then you have to parse it at index time and store > that.That was why I suggested just running some queries through the > interface and the URL to see what Solr actually captured from your indexed > PDF before worrying about how it looks on the screen. > > Next, you should look carefully at the Analyzer's output. Notice the > abbreviations to the left of the columns? Hover over those to see what > filter factory it is. When words are split into multiple columns at one of > those points, it indicates that the filter factory broke apart the word > while analyzing it. Do a search for the filter filter factories that you > find and read up on them. In my case "1a" was being split into 4 by a word > delimiter filter factory - "1a", "1", "a", "1a" which caused highlighting > to fail in my case while still getting a hit. It also caused erroneous hits > elsewhere. Adding some switches to the schema is all it took to correct > that for me. However, every case is different based on your needs. That is > why it is important to go through the analyzer and see if Solr's indexing > and querying are doing what you expect. > > If that looks good and you've got solid hits all the way down, then it is > time to start looking at your highlighter implementation in the index and > query analyzers that you are using. My original issue of not being able to > highlight phrases with one set of tags necessitated me switching to the > fast vector highlighter - which had its own requirements for certain > parameters to be set. Here again - going to the Solr docs and reading up on > the various highlighters will be helpful in most cases. > > Solr has a very steep learning curve. I've been using it for several years > and I still consider myself a noob. It can be a deep dive, but don't be > discouraged. Keep at it. Cheers! > > -Teague > > On Wed, Dec 16, 2015 at 8:54 PM, Evert R. <evert.ra...@gmail.com> wrote: > > > Hi Erick and Teague, > > > > > > I found that when using the field 'text' it shows the pdf file result > > id:pdf1 in this case, like: > > > > http://localhost:8983/solr/techproducts/select?fq=id:pdf1&q=nietava > > > > but when highlight, using the text field...nothing comes up... > > > > > > > http://localhost:8983/solr/techproducts/select?q=text:nietava&fq=id:pdf1&wt=json&indent=true&hl=true&hl.fl=text&hl.simple.pre=%3Cem%3E&hl.simple.post=%3C%2Fem%3E > > > > of even with the option > > > > f.text.hl.snippets=2 under the hl.fl field. > > > > > > I tried as well with the standard configuration, did it all over, > reindexed > > a couple times... and still did not work. > > > > Also, > > > > Using the Analysis, it brings below information: > > > > ST > > textraw_bytesstartendpositionLengthtypeposition > > nietava[6e 69 65 74 61 76 61]071<ALPHANUM>1 > > SF > > textraw_bytesstartendpositionLengthtypeposition > > nietava[6e 69 65 74 61 76 61]071<ALPHANUM>1 > > LCF > > textraw_bytesstartendpositionLengthtypeposition > > nietava[6e 69 65 74 61 76 61]071<ALPHANUM>1 > > > > > > Alphanumeric I think... so, it´s 'string', right? would that be a > problem? > > Should be some other indication? > > > > > > Thanks again! > > > > > > *Evert* > > > > 2015-12-16 21:09 GMT-02:00 Erick Erickson <erickerick...@gmail.com>: > > > > > I think you're still missing the critical bit. Highlighting is > > > completely separate from searching. In other words, you can search on > > > one field and highlight another. What field is searched is governed by > > > the "qf" parameter when using edismax and by the the "df" parameter > > > configured in your request handler in solrconfig.xml. These defaults > > > are overridden when you do a "fielded search" like > > > > > > q=content:nietava > > > > > > So this: q=content:nietava&hl=true&hl.fl=content > > > is searching the "content" field. The word you're looking for isn't in > > > the content field so naturally no docs are returned. And no > > > highlighting either. > > > > > > This: q=nietava&hl=true&hl.fl=content > > > > > > is searching somewhere else, thus getting the hit. We already know > > > that "nietava" is not in the content field because the first search > > > failed. You need to find out what field is being matched (probably > > > something like "text") and then try highlighting on _that_ field. Try > > > adding "debug=query" to the URL and look at the "parsed_query" section > > > of the return and you'll see what field(s) is/are actually being > > > searched against. > > > > > > NOTE: The field you highlight on _must_ have stored="true" in > schema.xml. > > > > > > As to why "nietava" isn't being found in the content field, probably > > > you have some kind of analysis chain configured for that field that > > > isn't searching as you expect. See the admin/analysis page for some > > > insight into why that would be. The most frequent reason is that the > > > field is a "string" type which is not broken up into words. Another > > > possibility is that your analysis chain is leaving in the quotes or > > > something similar. As James says, looking at admin/analysis is a good > > > way to figure this out. > > > > > > I still strongly recommend you go from the stock techproducts example > > > and get familiar with how Solr (and highlighting) work before jumping > > > in and changing things. There are a number of ways things can be > > > mis-configured and trying to change several things at once is a fine > > > way to go mad. The admin UI>>schema browser is another way you can see > > > what kind of terms are _actually_ in your index in a particular field. > > > > > > Best, > > > Erick > > > > > > > > > > > > > > > On Wed, Dec 16, 2015 at 12:26 PM, Teague James < > teag...@insystechinc.com > > > > > > wrote: > > > > Sorry to hear that didn't work! Let me ask a couple of questions... > > > > > > > > Have you tried the analyzer inside of the Admin Interface? It has > > helped > > > me sort out a number of highlighting issues in the past. To access it, > go > > > to your Admin interface, select your core, then select Analysis from > the > > > list of options on the left. In the analyzer, enter the term you are > > > indexing in the top left (in other words the term in the document you > are > > > indexing that you expect to get a hit on) and right input fields. > Select > > > the field that it is destined for (in your case that would be > 'content'), > > > then hit analyze. Helps if you have a big screen! > > > > > > > > This will show you the impact of the various filter factories that > you > > > have engaged and their effect on whether or not a 'hit' is being > > generated. > > > Hits are idietified by a very feint highlight. (PSST... Developers... > It > > > would be really cool if the highlight color were more visible or > > > customizable... Thanks y'all) If it looks like you're getting hits, but > > not > > > getting highlighting, then open up a new tab with the Admin's query > > > interface. Same place on the left as the analyzer. Replace the "*:*" > with > > > your search term (assuming you already indexed your document) and if > > > necessary you can put something in the FQ like "id:123456" to target a > > > specific record. > > > > > > > > Did you get a hit? If no, then it's not highlighting that's the > issue. > > > If yes, then try dumping this in your address bar (using your URL/IP, > > > search term, and core name of course. The fq= is an example) : > > > > http:// > [URL/IP]/solr/[CORE-NAME]/select?fq=id:123456&q="[SEARCH-TERM]" > > > > > > > > That will dump Solr's output to your browser where you can see > exactly > > > what is getting hit. > > > > > > > > Hope that helps! Let me know how it goes. Good luck. > > > > > > > > -Teague > > > > > > > > -----Original Message----- > > > > From: Evert R. [mailto:evert.ra...@gmail.com] > > > > Sent: Wednesday, December 16, 2015 1:46 PM > > > > To: solr-user <solr-user@lucene.apache.org> > > > > Subject: Re: Solr Basic Configuration - Highlight - Begginer > > > > > > > > Hi Teague! > > > > > > > > I configured the solrconf.xml and schema.xml exactly the way you did, > > > only substituting the word 'documentText' per 'content' used by the > > > techproducts sample, I reindex through : > > > > > > > > curl ' > > > > > > > > > > http://localhost:8983/solr/techproducts/update/extract?literal.id=pdf1&commit=true > > > ' > > > > -F "Emmanuel=@/home/solr/dados/teste/Emmanuel.pdf" > > > > > > > > with the same result.... no highlight in the respond as below: > > > > > > > > "highlighting": { "pdf1": {} } > > > > > > > > =( > > > > > > > > Really... do not know what to do... > > > > > > > > Thanks for your time, if you have any more suggestion where I could > be > > > missing something... please let me know. > > > > > > > > > > > > Best regards, > > > > > > > > *Evert* > > > > > > > > 2015-12-16 15:30 GMT-02:00 Teague James <teag...@insystechinc.com>: > > > > > > > >> Hi Evert, > > > >> > > > >> I recently needed help with phrase highlighting and was pointed to > the > > > >> FastVectorHighlighter which worked out great. I just made a change > to > > > >> the configuration to add generateWordParts="0" and > > > >> generateNumberParts="0" so that searches for things like "1a" would > > > >> get highlighted correctly. You may or may not need that feature. You > > > >> can always remove them or change the value to "1" to switch them on > > > explicitly. Anyway, hope this helps! > > > >> > > > >> solrconfig.xml (partial snip) > > > >> <requestHandler name="/select" class="solr.SearchHandler"> > > > >> <lst name="defaults"> > > > >> <str name="wt">xml</str> > > > >> <str name="echoParams">explicit</str> > > > >> <int name="rows">10</int> > > > >> <str name="df">documentText</str> > > > >> <str name="hl">on</str> > > > >> <str name="hl.fl">text</str> > > > >> <str > > > name="hl.useFastVectorHighlighter">true</str> > > > >> <str name="hl.snippets">100</str> > > > >> <str name="hl.tag.pre"><b></str> > > > >> <str name="hl.tag.post"></b></str> > > > >> </lst> > > > >> </requestHandler> > > > >> > > > >> schema.xml (partial snip) > > > >> <field name="id" type="string" indexed="true" stored="true" > > > >> required="true" multiValued="false" /> > > > >> <field name="documentText" type="text_general" indexed="true" > > > >> multivalued="true" termVectors="true" termOffsets="true" > > > >> termPositions="true" /> > > > >> > > > >> <fieldType name="text_general" class="solr.TextField" > > > >> positionIncrementGap="100"> > > > >> <analyzer type="index"> > > > >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > > >> <filter class="solr.StopFilterFactory" > > ignoreCase="true" > > > >> words="stopwords.txt" /> > > > >> <filter class="solr.WordDelimiterFilterFactory" > > > >> catenateAll="1" preserveOriginal="1" generateNumberParts="0" > > > >> generateWordParts="0" /> > > > >> <filter class="solr.SynonymFilterFactory" > > > >> synonyms="index_synonyms.txt" ignoreCase="true" expand="true"/> > > > >> <filter class="solr.LowerCaseFilterFactory"/> > > > >> <filter class="solr.PorterStemFilterFactory"/> > > > >> <filter class="solr.ApostropheFilterFactory"/> > > > >> </analyzer> > > > >> <analyzer type="query"> > > > >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > > >> <filter class="solr.WordDelimiterFilterFactory" > > > >> catenateAll="1" preserveOriginal="1" generateWordParts="0" /> > > > >> <filter class="solr.StopFilterFactory" > > ignoreCase="true" > > > >> words="stopwords.txt" /> > > > >> <filter class="solr.LowerCaseFilterFactory"/> > > > >> <filter class="solr.ApostropheFilterFactory"/> > > > >> </analyzer> > > > >> </fieldType> > > > >> > > > >> -Teague > > > >> > > > >> From: Evert R. [mailto:evert.ra...@gmail.com] > > > >> Sent: Tuesday, December 15, 2015 6:25 AM > > > >> To: solr-user@lucene.apache.org > > > >> Subject: Solr Basic Configuration - Highlight - Begginer > > > >> > > > >> Hi there! > > > >> > > > >> It´s my first installation, not sure if here is the right channel... > > > >> > > > >> Here is my steps: > > > >> > > > >> 1. Set up a basic install of solr 5.4.0 > > > >> > > > >> 2. Create a new core through command line (bin/solr create -c test) > > > >> > > > >> 3. Post 2 files: 1 .docx and 2 .pdf (bin/post -c test /docs/test/) > > > >> > > > >> 4. Query over the browser and it brings the correct search, but it > > > >> does not show the part of the text I am querying, the highlight. > > > >> > > > >> I have already flagled the 'hl' option. But still it does not > > word... > > > >> > > > >> Exemple: I am looking for the word 'peace' in my pdf file (book) I > > > >> have 4 matches for this word, it shows me the book name (pdf file) > but > > > >> does not bring which part of the text it has the word peace on it. > > > >> > > > >> > > > >> I am problably missing some configuration in schema.xml, which is > > > >> missing from my folder.... /solr/server/solr/test/conf/ > > > >> > > > >> Or even the solrconfig.xml... > > > >> > > > >> I have read a bunch of things about highlight check these files, > > > >> copied the standard schema.xml to my core/conf folder, but still it > > > >> does not bring the highlight. > > > >> > > > >> > > > >> Attached a copy of my solrconfig.xml file. > > > >> > > > >> > > > >> I am very sorry for this, probably, dumb and too basic question... > > > >> First time I see solr in live. > > > >> > > > >> > > > >> Any help will be appreciated. > > > >> > > > >> > > > >> > > > >> Best regards, > > > >> > > > >> > > > >> Evert Ramos > > > >> > > > >> mailto:evert.ra...@gmail.com > > > >> > > > >> > > > >> > > > > > > > > > > > > > -- > Kind regards, > > -Teague James > *Senior Web Applications Developer* > Insystech Inc. > teag...@insystechinc.com > (703) 508-0008 (Cell) >