Re: Highlight brings the content from the first pages of pdf

Evert R. Mon, 15 Feb 2016 02:08:27 -0800

Hello Mark,

Thanks for you reply.


All text is indexed (1 pdf file). It works now.

Best regard,


*--Evert*

2016-02-14 23:47 GMT-02:00 Mark Ehle <marke...@gmail.com>:

> is all the text being indexed? Check to make sure that there's actually the
> data you are looking for in the index. Is there a setting in tika that
> limits how much is indexed? I seem to remember confronting this problem
> myself once, and the data that I wanted just wasn't in the index because it
> was never put there in the first place.Something about setMaxStringLength
> orsomething.
>
> On Sun, Feb 14, 2016 at 8:28 PM, Binoy Dalal <binoydala...@gmail.com>
> wrote:
>
> > What you've done so far will highlight every instance of "nietava" found
> in
> > the field, and return it, i.e., your entire field will return with all
> the
> > "nietava"s in <em> tags.
> > If you do not want the entire field, only portions of your field
> containing
> > the matched terms, then use hl.snippets parameter = the number of
> snippets
> > you want, in this particular case 3, along with the hl.fragsize parameter
> > set to the same number as your hl.mazAnalyzedChars (or a really large
> > number).
> >
> > I suggest you go through the wiki documentation for highlighting once (
> > https://wiki.apache.org/solr/HighlightingParameters). It should answer
> all
> > of your questions regarding the use of the standard highlighter that you
> > might have.
> >
> > As an additional note, I also suggest that you look into the
> > PostingsHighlighter (
> > https://cwiki.apache.org/confluence/display/solr/Postings+Highlighter),
> > since you seem to be running highlighting on pretty big fields and
> postings
> > is much more efficient at highlighting huge fields as compared to the
> > standard highlighter.
> >
> > On Mon, Feb 15, 2016 at 4:15 AM Evert R. <evert.ra...@gmail.com> wrote:
> >
> > > Binoy,
> > >
> > > You are the man! =)
> > >
> > > Thank you very much!
> > >
> > > Would you by chance know how could I get the second highlight of the
> same
> > > word in the same file?
> > >
> > > Like: file_1.pdf (has three words "nietava") so..., how can I bring the
> > > highlighs for the three occurrences?
> > >
> > > I am pretty new around, should I send (open) another subject?
> > >
> > > Thanks again!
> > >
> > >
> > > *--Evert*
> > >
> > > 2016-02-14 16:40 GMT-02:00 Binoy Dalal <binoydala...@gmail.com>:
> > >
> > > > Are you sure you've typed in the parameters correctly?
> > > > In your response it says flagsize instead of fragsize and
> > > maxanalzyedchars
> > > > instead of maxanalyzedchars.
> > > >
> > > > Ohh wait, I see that I made the analyzed typo. Awfully sorry for
> that,
> > > I'm
> > > > using my phone to send the mail out.
> > > >
> > > > On Sun, 14 Feb 2016, 23:53 Evert R. <evert.ra...@gmail.com> wrote:
> > > >
> > > > > Hi Binoy,
> > > > >
> > > > > thanks!
> > > > >
> > > > > Still not working, check the output:
> > > > >
> > > > > {
> > > > >   "responseHeader":{
> > > > >     "status":0,
> > > > >     "QTime":58,
> > > > >     "params":{
> > > > >       "q":"nietava",
> > > > >       "hl":"true",
> > > > >       "hl.simple.post":"</em>",
> > > > >       "indent":"true",
> > > > >       "fl":"id",
> > > > >       "hl.flagsize":"0",
> > > > >       "hl.fl":"content",
> > > > >       "hl.maxAnalzyedChars":"208400",
> > > > >       "wt":"json",
> > > > >       "hl.simple.pre":"<em>"}},
> > > > >   "response":{"numFound":1,"start":0,"docs":[
> > > > >       {
> > > > >         "id":"/home/solr/dados/teste/Emmanuel.pdf"}]
> > > > >   },
> > > > >   "highlighting":{
> > > > >     "/home/solr/dados/teste/Emmanuel.pdf":{}}}
> > > > >
> > > > >
> > > > >
> > > > > *--Evert*
> > > > >
> > > > > 2016-02-14 14:31 GMT-02:00 Binoy Dalal <binoydala...@gmail.com>:
> > > > >
> > > > > > Don't add this parameter to the searchComponent definition,
> because
> > > the
> > > > > > components where you've added it, GapFragmenter and
> > RegexFragmenter,
> > > > > simply
> > > > > > don't use it.
> > > > > > Instead, add it to your request handler (/select etc.) if you've
> > > > > configured
> > > > > > highlighting in the handler or append it to your query:
> > > > > > *&hl.maxAnalzyedChars=<some_really_big_number>*.
> > > > > > Additionally also set the *hl.fragsize parameter to 0*, if your
> > text
> > > is
> > > > > > larger than 51200 chars which it mostly is, in a similar fashion.
> > > > > >
> > > > > >
> > > > > > On Sun, Feb 14, 2016 at 9:02 PM Evert R. <evert.ra...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi Binoy,
> > > > > > >
> > > > > > > I could not find this option in my solrconfig.xml file. ]
> > > > > > >
> > > > > > > I tryied to add this setting and nothing changed...
> > > > > > >
> > > > > > > Here is the code, I might miss placed:
> > > > > > >
> > > > > > > <code>
> > > > > > > <searchComponent class="solr.HighlightComponent"
> > name="highlight">
> > > > > > >     <highlighting>
> > > > > > >       <!-- Configure the standard fragmenter -->
> > > > > > >       <!-- This could most likely be commented out in the
> > "default"
> > > > > case
> > > > > > > -->
> > > > > > >       <fragmenter name="gap"
> > > > > > >                   default="true"
> > > > > > >                   class="solr.highlight.GapFragmenter">
> > > > > > >         <lst name="defaults">
> > > > > > >           <int name="hl.fragsize">400</int>
> > > > > > >           <int name="hl.maxAnalyzedChars">409600</int>
> > > > > > >         </lst>
> > > > > > >       </fragmenter>
> > > > > > >
> > > > > > >       <!-- A regular-expression-based fragmenter
> > > > > > >            (for sentence extraction)
> > > > > > >         -->
> > > > > > >       <fragmenter name="regex"
> > > > > > >                   class="solr.highlight.RegexFragmenter">
> > > > > > >         <lst name="defaults">
> > > > > > >           <!-- slightly smaller fragsizes work better because
> of
> > > slop
> > > > > -->
> > > > > > >           <int name="hl.fragsize">200</int>
> > > > > > >           <int name="hl.maxAnalyzedChars">409600</int>
> > > > > > >           <!-- allow 50% slop on fragment sizes -->
> > > > > > >           <float name="hl.regex.slop">0.5</float>
> > > > > > >           <!-- a basic sentence pattern -->
> > > > > > >           <str name="hl.regex.pattern">[-\w
> > > > > > > ,/\n\&quot;&apos;]{20,200}</str>
> > > > > > >         </lst>
> > > > > > >       </fragmenter>
> > > > > > >
> > > > > > > </code>
> > > > > > >
> > > > > > > thanks!
> > > > > > >
> > > > > > >
> > > > > > > *--Evert*
> > > > > > >
> > > > > > > 2016-02-14 12:14 GMT-02:00 Binoy Dalal <binoydala...@gmail.com
> >:
> > > > > > >
> > > > > > > > From the solr wiki:
> > > > > > > > hl.maxAnalyzedChars
> > > > > > > >
> > > > > > > > How many characters into a document to look for suitable
> > > > > > > > snippets  Solr1.3. This parameter makes sense for the
> original
> > > > > > > Highlighter
> > > > > > > > only.
> > > > > > > >
> > > > > > > > The default value is "51200".
> > > > > > > >
> > > > > > > > You can assign a large value to this parameter and use
> > > > hl.fragsize=0
> > > > > to
> > > > > > > > return highlighting in large fields that have size greater
> than
> > > > 51200
> > > > > > > > characters.
> > > > > > > >
> > > > > > > > I think this might be your hiccup.
> > > > > > > >
> > > > > > > > On Sun, 14 Feb 2016, 17:11 Evert R. <evert.ra...@gmail.com>
> > > wrote:
> > > > > > > >
> > > > > > > > > Hi Paul,
> > > > > > > > >
> > > > > > > > > Sorry my late reply.
> > > > > > > > >
> > > > > > > > > All the content is inside de docs. It brings the docs and
> the
> > > pdf
> > > > > > file
> > > > > > > > that
> > > > > > > > > has the search word in it. But the highlight is not showing
> > if
> > > > the
> > > > > > > search
> > > > > > > > > word is after a few pages.
> > > > > > > > >
> > > > > > > > > Evert
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > *--Evert*
> > > > > > > > >
> > > > > > > > > 2016-02-14 8:36 GMT-02:00 Paul Libbrecht <
> p...@hoplahup.net
> > >:
> > > > > > > > >
> > > > > > > > > > This looks like the stored content is shortened. Can it
> be?
> > > > > > > > > > Can you see that inside the docs?
> > > > > > > > > >
> > > > > > > > > > paul
> > > > > > > > > >
> > > > > > > > > > > Evert R. <mailto:evert.ra...@gmail.com>
> > > > > > > > > > > 14 February 2016 at 11:26
> > > > > > > > > > > Hi There,
> > > > > > > > > > >
> > > > > > > > > > > I have a situation where started a techproducts,
> without
> > > any
> > > > > > > > > > modification,
> > > > > > > > > > > post a pdf file. When searching as:
> > > > > > > > > > >
> > > > > > > > > > > q=text:search_word
> > > > > > > > > > > hl=true
> > > > > > > > > > > hl.fl=content
> > > > > > > > > > >
> > > > > > > > > > > It show the highlight accordingly! =)
> > > > > > > > > > >
> > > > > > > > > > > BUT... *if the "search_word" is after the first pages*
> in
> > > my
> > > > > pdf
> > > > > > > > file,
> > > > > > > > > > > such
> > > > > > > > > > > as page 15...
> > > > > > > > > > >
> > > > > > > > > > > It simply *does not show* *the HIGHLIGHT*...
> > > > > > > > > > >
> > > > > > > > > > > Does anyone has faced this situation before?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks!
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > *--Evert*
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > --
> > > > > > > > Regards,
> > > > > > > > Binoy Dalal
> > > > > > > >
> > > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > > Binoy Dalal
> > > > > >
> > > > >
> > > > --
> > > > Regards,
> > > > Binoy Dalal
> > > >
> > >
> > --
> > Regards,
> > Binoy Dalal
> >
>

Re: Highlight brings the content from the first pages of pdf

Reply via email to