Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

ZiYuan Mon, 19 Jun 2017 08:44:31 -0700

Dear Erick and Timothy,

yes I will parse from the client for all the benefits. I am just trying to
figure out what is going on by indexing one or two PDF files first. Thank
you both.


Best regards,
Ziyuan

On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> bq: Hope that there is no side effect of not mapping the PDF
>
> Well, yes it will have that side effect. You can cure that with a
> copyField directive from content to _text_.
>
> But do really consider running this as a SolrJ program on the client.
> Tim knows in far more painful detail than I do what kinds of problems
> there are when parsing all the different formats so I'd _really_
> follow his advice.
>
> Tika pretty much has an impossible job. "Here, try to parse all these
> different formats, implemented by different vendors with different
> versions that more or less follow a spec which really isn't a spec in
> many cases just recommendations using packages that may or may not be
> actively maintained. And by the way, we'll try to handle that 1G
> document that someone sends us, but don't blame us if we hit an
> OOM.....". When Tika is run on the same box as Solr any problems in
> that entire chain can adversely affect your search.
>
> Not to mention that Tika has to do some heavy lifting, using CPU
> cycles that are unavailable for Solr.
>
> Extracting Request Handler is a fine way to get started, but for
> production seriously consider a separate client.
>
> Best,
> Erick
>
> On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan <ziyu...@gmail.com> wrote:
> > Hi Erick,
> >
> > Now it is clear. I have to update the request handler of /update/extract/
> > from
> > "defaults":{"fmap.content":"_text_"}
> > to
> > "defaults":{"fmap.content":"content"}
> > to fill the field.
> >
> > Hope that there is no side effect of not mapping the PDF content to
> _text_.
> > Thank you for the hint.
> >
> > Best regards,
> > Ziyuan
> >
> > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher <erik.hatc...@gmail.com>
> > wrote:
> >
> >> Ziyuan -
> >>
> >> You may be interested in the example/files that ships with Solr too.
> It’s
> >> got schema and config and even UI for file indexing and searching.
>  Check
> >> it out README.txt under example/files in your Solr install.
> >>
> >>         Erik
> >>
> >> > On Jun 19, 2017, at 6:52 AM, ZiYuan <ziyu...@gmail.com> wrote:
> >> >
> >> > Hi Erick,
> >> >
> >> > thanks very much for the explanations! Clarification for question 2:
> more
> >> > specifically I cannot see the field content in the returned JSON, with
> >> the
> >> > the same definitions as in the post
> >> > <http://www.codewrecks.com/blog/index.php/2013/05/27/
> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
> >> > :
> >> >
> >> > <field name="content" type="text_general" indexed="false"
> stored="true"/>
> >> > <field name="text" type="text_general" multiValued="true"
> indexed="true"
> >> > stored="false"/>
> >> > <copyField source="content" dest="text"/>
> >> >
> >> > Is it so that Tika does not fill these two fields automatically and I
> >> have
> >> > to write some client code to fill them?
> >> >
> >> > Best regards,
> >> > Ziyuan
> >> >
> >> >
> >> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <
> erickerick...@gmail.com
> >> >
> >> > wrote:
> >> >
> >> >> 1> Yes, you can use your single definition. The author identifies the
> >> >> "text" field as a catch-all. Somewhere in the schema there'll be a
> >> >> copyField directive copying (perhaps) many different fields to the
> >> >> "text" field. That permits simple searches against a single field
> >> >> rather than, say, using edismax to search across multiple separate
> >> >> fields.
> >> >>
> >> >> 2> The link you referenced is for Data Import Handler, which is much
> >> >> different than just posting files to Solr. See
> >> >> ExtractingRequestHandler:
> >> >> https://cwiki.apache.org/confluence/display/solr/
> >> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
> >> >> There are ways to map meta-data fields from the doc into specific
> >> >> fields matching your schema. Be a little careful here. There is no
> >> >> standard across different types of docs as to what meta-data field is
> >> >> included. PDF might have a "last_edited" field. Word might have a
> >> >> "last_modified" field where the two mean the same thing. Here's a
> link
> >> >> to a SolrJ program that'll dump all the fields:
> >> >> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can
> easily
> >> >> hack out the DB bits.
> >> >>
> >> >> BTW, once you get more familiar with processing, I strongly recommend
> >> >> you do the document processing on the client, the reasons are
> outlined
> >> >> in that article.
> >> >>
> >> >> bq: even I define the fields as he said I cannot see them in the
> >> >> search results as keys in JSON
> >> >> are the fields set as stored="true"? They must be to be returned in
> >> >> requests (skipping the docValues discussion here).
> >> >>
> >> >> 3> Yes, the text field is a concatenation of all the other ones.
> >> >> Because it has stored=false, you can only search it, you cannot
> >> >> highlight or view. Fields you highlight must have stored=true BTW.
> >> >>
> >> >> Whether or not you can highlight "Trevor Hastie" depends an a lot of
> >> >> things, most particularly whether that text is ever actually in a
> >> >> field in your index. Just because there's no guarantee that the name
> >> >> of the file is indexed in a searchable/highlightable way.
> >> >>
> >> >> And the query q=id:Trevor Hastie won't do what you think. It'll be
> >> parsed
> >> >> as
> >> >> id:Trevor _text_:Hastie
> >> >> _text_ is the default field, look for a "df" parameter in your
> request
> >> >> handler in solrconfig.xml (usually "/select" or "/query").
> >> >>
> >> >> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <ziyu...@gmail.com> wrote:
> >> >>> Hi,
> >> >>>
> >> >>> I am new to Solr and I need to implement a full-text search of some
> PDF
> >> >>> files. The indexing part works out of the box by using bin/post. I
> can
> >> >> see
> >> >>> search results in the admin UI given some queries, though without
> the
> >> >>> matched texts and the context.
> >> >>>
> >> >>> Now I am reading this post
> >> >>> <http://www.codewrecks.com/blog/index.php/2013/05/27/
> >> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
> >> >>> for the highlighting part. It is for an older version of Solr when
> >> >> managed
> >> >>> schema was not available. Before fully understand what it is doing I
> >> have
> >> >>> some questions:
> >> >>>
> >> >>> 1. He defined two fields:
> >> >>>
> >> >>> <field name="content" type="text_general" indexed="false"
> stored="true"
> >> >>> multiValued="false"/>
> >> >>> <field name="text" type="text_general" indexed="true" stored="false"
> >> >>> multiValued="true"/>
> >> >>>
> >> >>> But why are there two fields needed? Can I define a field
> >> >>>
> >> >>> <field name="content" type="text_general" indexed="true"
> stored="true"
> >> >>> multiValued="true"/>
> >> >>>
> >> >>> to capture the full text?
> >> >>>
> >> >>> 2. How are the fields filled? I don't see relevant information in
> >> >>> TikaEntityProcessor's documentation
> >> >>> <https://lucene.apache.org/solr/6_6_0/solr-
> >> dataimporthandler-extras/org/
> >> >> apache/solr/handler/dataimport/TikaEntityProcessor.html#
> >> >> fields.inherited.from.class.org.apache.solr.handler.
> >> >> dataimport.EntityProcessorBase>.
> >> >>> The current text extractor should already be Tika (I can see
> >> >>>
> >> >>> "x_parsed_by":
> >> >>> ["org.apache.tika.parser.DefaultParser","org.apache.
> >> >> tika.parser.pdf.PDFParser"]
> >> >>>
> >> >>> in the returned JSON of some query). But even I define the fields
> as he
> >> >>> said I cannot see them in the search results as keys in JSON.
> >> >>>
> >> >>> 3. The _text_ field seems a concatenation of other fields, does it
> >> >> contain
> >> >>> the full text? Though it does not seem to be accessible by default.
> >> >>>
> >> >>> To be brief, using The Elements of Statistical Learning
> >> >>> <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/
> >> >> ESLII_print10.pdf>
> >> >>> as an example, how to highlight the relevant texts for the query
> "SVM"?
> >> >> And
> >> >>> if changing the file name into "The Elements of Statistical
> Learning -
> >> >>> Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for
> >> the
> >> >>> query "id:Trevor Hastie"?
> >> >>>
> >> >>> Thank you.
> >> >>>
> >> >>> Best regards,
> >> >>> Ziyuan
> >> >>
> >>
> >>
>

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Reply via email to