Re: Solr Basic Configuration - Highlight - Begginer

Evert R. Thu, 17 Dec 2015 03:01:12 -0800

Hello Teague,

Thanks for your reply and tip! I think Solr will give me a better result
than just using Tika to read up my files and send to a Fulltext Index in my
MySQL, which has the precise point of not highlighting the text snippets...


So, I will keep on trying to fix Solr to my needs, and sure it works... I
am missing something.

Thanks again and I will keep on track.

When I find the solution I will post all files and configs here for future
references.

Best regards,

*Evert*

2015-12-17 6:11 GMT-02:00 Teague James <teag...@insystechinc.com>:

> Erik's comments not withstanding, there are some gaps in my understanding
> of your precise situation. Here's a few things that weren't necessarily
> obvious to me when I took my first try with Solr.
>
> Highlighting is the end result of a good hit. It is essentially formatting
> applied to your hit. It is possible to get a hit without a highlight if
> certain conditions exist.
>
> First, start by making sure you are indexing your target (a PDF file?)
> correctly. Assuming you are indexing PDFs, are you extracting meta data
> only or are you parsing the document with Tika? If you want hits on the
> contents of your PDF, then you have to parse it at index time and store
> that.That was why I suggested just running some queries through the
> interface and the URL to see what Solr actually captured from your indexed
> PDF before worrying about how it looks on the screen.
>
> Next, you should look carefully at the Analyzer's output. Notice the
> abbreviations to the left of the columns? Hover over those to see what
> filter factory it is. When words are split into multiple columns at one of
> those points, it indicates that the filter factory broke apart the word
> while analyzing it. Do a search for the filter filter factories that you
> find and read up on them. In my case "1a" was being split into 4 by a word
> delimiter filter factory - "1a", "1", "a", "1a" which caused highlighting
> to fail in my case while still getting a hit. It also caused erroneous hits
> elsewhere. Adding some switches to the schema is all it took to correct
> that for me. However, every case is different based on your needs. That is
> why it is important to go through the analyzer and see if Solr's indexing
> and querying are doing what you expect.
>
> If that looks good and you've got solid hits all the way down, then it is
> time to start looking at your highlighter implementation in the index and
> query analyzers that you are using. My original issue of not being able to
> highlight phrases with one set of tags necessitated me switching to the
> fast vector highlighter - which had its own requirements for certain
> parameters to be set. Here again - going to the Solr docs and reading up on
> the various highlighters will be helpful in most cases.
>
> Solr has a very steep learning curve. I've been using it for several years
> and I still consider myself a noob. It can be a deep dive, but don't be
> discouraged. Keep at it. Cheers!
>
> -Teague
>
> On Wed, Dec 16, 2015 at 8:54 PM, Evert R. <evert.ra...@gmail.com> wrote:
>
> > Hi Erick and Teague,
> >
> >
> > I found that when using the field 'text' it shows the pdf file result
> > id:pdf1 in this case, like:
> >
> > http://localhost:8983/solr/techproducts/select?fq=id:pdf1&q=nietava
> >
> > but when highlight, using the text field...nothing comes up...
> >
> >
> >
> http://localhost:8983/solr/techproducts/select?q=text:nietava&fq=id:pdf1&wt=json&indent=true&hl=true&hl.fl=text&hl.simple.pre=%3Cem%3E&hl.simple.post=%3C%2Fem%3E
> >
> > of even with the option
> >
> > f.text.hl.snippets=2 under the hl.fl field.
> >
> >
> > I tried as well with the standard configuration, did it all over,
> reindexed
> > a couple times... and still did not work.
> >
> > Also,
> >
> > Using the Analysis, it brings below information:
> >
> > ST
> > textraw_bytesstartendpositionLengthtypeposition
> > nietava[6e 69 65 74 61 76 61]071<ALPHANUM>1
> > SF
> > textraw_bytesstartendpositionLengthtypeposition
> > nietava[6e 69 65 74 61 76 61]071<ALPHANUM>1
> > LCF
> > textraw_bytesstartendpositionLengthtypeposition
> > nietava[6e 69 65 74 61 76 61]071<ALPHANUM>1
> > 
> >
> > Alphanumeric I think... so, it´s 'string', right? would that be a
> problem?
> > Should be some other indication?
> >
> >
> > Thanks again!
> >
> >
> > *Evert*
> >
> > 2015-12-16 21:09 GMT-02:00 Erick Erickson <erickerick...@gmail.com>:
> >
> > > I think you're still missing the critical bit. Highlighting is
> > > completely separate from searching. In other words, you can search on
> > > one field and highlight another. What field is searched is governed by
> > > the "qf" parameter when using edismax and by the the "df" parameter
> > > configured in your request handler in solrconfig.xml. These defaults
> > > are overridden when you do a "fielded search" like
> > >
> > > q=content:nietava
> > >
> > > So this: q=content:nietava&hl=true&hl.fl=content
> > > is searching the "content" field. The word you're looking for isn't in
> > > the content field so naturally no docs are returned. And no
> > > highlighting either.
> > >
> > > This: q=nietava&hl=true&hl.fl=content
> > >
> > > is searching somewhere else, thus getting the hit. We already know
> > > that "nietava" is not in the content field because the first search
> > > failed. You need to find out what field is being matched (probably
> > > something like "text") and then try highlighting on _that_ field. Try
> > > adding "debug=query" to the URL and look at the "parsed_query" section
> > > of the return and you'll see what field(s) is/are actually being
> > > searched against.
> > >
> > > NOTE: The field you highlight on _must_ have stored="true" in
> schema.xml.
> > >
> > > As to why "nietava" isn't being found in the content field, probably
> > > you have some kind of analysis chain configured for that field that
> > > isn't searching as you expect. See the admin/analysis page for some
> > > insight into why that would be. The most frequent reason is that the
> > > field is a "string" type which is not broken up into words. Another
> > > possibility is that your analysis chain is leaving in the quotes or
> > > something similar. As James says, looking at admin/analysis is a good
> > > way to figure this out.
> > >
> > > I still strongly recommend you go from the stock techproducts example
> > > and get familiar with how Solr (and highlighting) work before jumping
> > > in and changing things. There are a number of ways things can be
> > > mis-configured and trying to change several things at once is a fine
> > > way to go mad. The admin UI>>schema browser is another way you can see
> > > what kind of terms are _actually_ in your index in a particular field.
> > >
> > > Best,
> > > Erick
> > >
> > >
> > >
> > >
> > > On Wed, Dec 16, 2015 at 12:26 PM, Teague James <
> teag...@insystechinc.com
> > >
> > > wrote:
> > > > Sorry to hear that didn't work! Let me ask a couple of questions...
> > > >
> > > > Have you tried the analyzer inside of the Admin Interface? It has
> > helped
> > > me sort out a number of highlighting issues in the past. To access it,
> go
> > > to your Admin interface, select your core, then select Analysis from
> the
> > > list of options on the left. In the analyzer, enter the term you are
> > > indexing in the top left (in other words the term in the document you
> are
> > > indexing that you expect to get a hit on) and right input fields.
> Select
> > > the field that it is destined for (in your case that would be
> 'content'),
> > > then hit analyze. Helps if you have a big screen!
> > > >
> > > > This will show you the impact of the various filter factories that
> you
> > > have engaged and their effect on whether or not a 'hit' is being
> > generated.
> > > Hits are idietified by a very feint highlight. (PSST... Developers...
> It
> > > would be really cool if the highlight color were more visible or
> > > customizable... Thanks y'all) If it looks like you're getting hits, but
> > not
> > > getting highlighting, then open up a new tab with the Admin's query
> > > interface. Same place on the left as the analyzer. Replace the "*:*"
> with
> > > your search term (assuming you already indexed your document) and if
> > > necessary you can put something in the FQ like "id:123456" to target a
> > > specific record.
> > > >
> > > > Did you get a hit? If no, then it's not highlighting that's the
> issue.
> > > If yes, then try dumping this in your address bar (using your URL/IP,
> > > search term, and core name of course. The fq= is an example) :
> > > > http://
> [URL/IP]/solr/[CORE-NAME]/select?fq=id:123456&q="[SEARCH-TERM]"
> > > >
> > > > That will dump Solr's output to your browser where you can see
> exactly
> > > what is getting hit.
> > > >
> > > > Hope that helps! Let me know how it goes. Good luck.
> > > >
> > > > -Teague
> > > >
> > > > -----Original Message-----
> > > > From: Evert R. [mailto:evert.ra...@gmail.com]
> > > > Sent: Wednesday, December 16, 2015 1:46 PM
> > > > To: solr-user <solr-user@lucene.apache.org>
> > > > Subject: Re: Solr Basic Configuration - Highlight - Begginer
> > > >
> > > > Hi Teague!
> > > >
> > > > I configured the solrconf.xml and schema.xml exactly the way you did,
> > > only substituting the word 'documentText' per 'content' used by the
> > > techproducts sample, I reindex through :
> > > >
> > > >  curl '
> > > >
> > >
> >
> http://localhost:8983/solr/techproducts/update/extract?literal.id=pdf1&commit=true
> > > '
> > > > -F "Emmanuel=@/home/solr/dados/teste/Emmanuel.pdf"
> > > >
> > > > with the same result.... no highlight in the respond as below:
> > > >
> > > > "highlighting": { "pdf1": {} }
> > > >
> > > > =(
> > > >
> > > > Really... do not know what to do...
> > > >
> > > > Thanks for your time, if you have any more suggestion where I could
> be
> > > missing something... please let me know.
> > > >
> > > >
> > > > Best regards,
> > > >
> > > > *Evert*
> > > >
> > > > 2015-12-16 15:30 GMT-02:00 Teague James <teag...@insystechinc.com>:
> > > >
> > > >> Hi Evert,
> > > >>
> > > >> I recently needed help with phrase highlighting and was pointed to
> the
> > > >> FastVectorHighlighter which worked out great. I just made a change
> to
> > > >> the configuration to add generateWordParts="0" and
> > > >> generateNumberParts="0" so that searches for things like "1a" would
> > > >> get highlighted correctly. You may or may not need that feature. You
> > > >> can always remove them or change the value to "1" to switch them on
> > > explicitly. Anyway, hope this helps!
> > > >>
> > > >> solrconfig.xml (partial snip)
> > > >> <requestHandler name="/select" class="solr.SearchHandler">
> > > >>                 <lst name="defaults">
> > > >>                         <str name="wt">xml</str>
> > > >>                         <str name="echoParams">explicit</str>
> > > >>                         <int name="rows">10</int>
> > > >>                         <str name="df">documentText</str>
> > > >>                         <str name="hl">on</str>
> > > >>                         <str name="hl.fl">text</str>
> > > >>                         <str
> > > name="hl.useFastVectorHighlighter">true</str>
> > > >>                         <str name="hl.snippets">100</str>
> > > >>                         <str name="hl.tag.pre"><b></str>
> > > >>                         <str name="hl.tag.post"></b></str>
> > > >>                 </lst>
> > > >> </requestHandler>
> > > >>
> > > >> schema.xml (partial snip)
> > > >>    <field name="id" type="string" indexed="true" stored="true"
> > > >> required="true" multiValued="false" />
> > > >>    <field name="documentText" type="text_general" indexed="true"
> > > >> multivalued="true" termVectors="true" termOffsets="true"
> > > >> termPositions="true" />
> > > >>
> > > >> <fieldType name="text_general" class="solr.TextField"
> > > >> positionIncrementGap="100">
> > > >>         <analyzer type="index">
> > > >>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > > >>                 <filter class="solr.StopFilterFactory"
> > ignoreCase="true"
> > > >> words="stopwords.txt" />
> > > >>                 <filter class="solr.WordDelimiterFilterFactory"
> > > >> catenateAll="1" preserveOriginal="1" generateNumberParts="0"
> > > >> generateWordParts="0" />
> > > >>                 <filter class="solr.SynonymFilterFactory"
> > > >> synonyms="index_synonyms.txt" ignoreCase="true" expand="true"/>
> > > >>                 <filter class="solr.LowerCaseFilterFactory"/>
> > > >>                 <filter class="solr.PorterStemFilterFactory"/>
> > > >>                 <filter class="solr.ApostropheFilterFactory"/>
> > > >>         </analyzer>
> > > >>         <analyzer type="query">
> > > >>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > > >>                 <filter class="solr.WordDelimiterFilterFactory"
> > > >> catenateAll="1" preserveOriginal="1" generateWordParts="0" />
> > > >>                 <filter class="solr.StopFilterFactory"
> > ignoreCase="true"
> > > >> words="stopwords.txt" />
> > > >>                 <filter class="solr.LowerCaseFilterFactory"/>
> > > >>                 <filter class="solr.ApostropheFilterFactory"/>
> > > >>         </analyzer>
> > > >> </fieldType>
> > > >>
> > > >> -Teague
> > > >>
> > > >> From: Evert R. [mailto:evert.ra...@gmail.com]
> > > >> Sent: Tuesday, December 15, 2015 6:25 AM
> > > >> To: solr-user@lucene.apache.org
> > > >> Subject: Solr Basic Configuration - Highlight - Begginer
> > > >>
> > > >> Hi there!
> > > >>
> > > >> It´s my first installation, not sure if here is the right channel...
> > > >>
> > > >> Here is my steps:
> > > >>
> > > >> 1. Set up a basic install of solr 5.4.0
> > > >>
> > > >> 2. Create a new core through command line (bin/solr create -c test)
> > > >>
> > > >> 3. Post 2 files: 1 .docx and 2 .pdf (bin/post -c test /docs/test/)
> > > >>
> > > >> 4. Query over the browser and it brings the correct search, but it
> > > >> does not show the part of the text I am querying, the highlight.
> > > >>
> > > >>   I have already flagled the 'hl' option. But still it does not
> > word...
> > > >>
> > > >> Exemple: I am looking for the word 'peace' in my pdf file (book) I
> > > >> have 4 matches for this word, it shows me the book name (pdf file)
> but
> > > >> does not bring which part of the text it has the word peace on it.
> > > >>
> > > >>
> > > >> I am problably missing some configuration in schema.xml, which is
> > > >> missing from my folder.... /solr/server/solr/test/conf/
> > > >>
> > > >> Or even the solrconfig.xml...
> > > >>
> > > >> I have read a bunch of things about highlight check these files,
> > > >> copied the standard schema.xml to my core/conf folder, but still it
> > > >> does not bring the highlight.
> > > >>
> > > >>
> > > >> Attached a copy of my solrconfig.xml file.
> > > >>
> > > >>
> > > >> I am very sorry for this, probably, dumb and too basic question...
> > > >> First time I see solr in live.
> > > >>
> > > >>
> > > >> Any help will be appreciated.
> > > >>
> > > >>
> > > >>
> > > >> Best regards,
> > > >>
> > > >>
> > > >> Evert Ramos
> > > >>
> > > >> mailto:evert.ra...@gmail.com
> > > >>
> > > >>
> > > >>
> > > >
> > >
> >
>
>
>
> --
> Kind regards,
>
> -Teague James
> *Senior Web Applications Developer*
> Insystech Inc.
> teag...@insystechinc.com
> (703) 508-0008 (Cell)
>

Re: Solr Basic Configuration - Highlight - Begginer

Reply via email to