RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Phil Scadden Tue, 20 Jun 2017 14:14:56 -0700

http -  however, the big advantage of doing your indexing on different machine 
is that the heavy lifting that tika does in extracting text from documents, 
finding metadata etc is not happening on the server. If the indexer crashes, it 
doesn’t affect Solr either.


-----Original Message-----
From: ZiYuan [mailto:ziyu...@gmail.com]
Sent: Tuesday, 20 June 2017 11:29 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF files with Solr 6.6 while allowing highlighting 
matched text with context

Dear Erick and Timothy,

I also took a look at the Python clients (say, SolrClient and pysolr) because 
Python is my main programming language. I have an impression that 1. they send 
HTTP requests to the server according to the server APIs; 2.
they are not official and thus possibly not up to date. Does SolrJ talk to the 
server via HTTP or some other more native ways? Is the main benefit of SolrJ 
over other clients the official shipment with Solr? Thank you.

Best regards,
Ziyuan

On Jun 19, 2017 18:43, "ZiYuan" <ziyu...@gmail.com> wrote:

> Dear Erick and Timothy,
>
> yes I will parse from the client for all the benefits. I am just
> trying to figure out what is going on by indexing one or two PDF files
> first. Thank you both.
>
> Best regards,
> Ziyuan
>
> On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson
> <erickerick...@gmail.com>
> wrote:
>
>> bq: Hope that there is no side effect of not mapping the PDF
>>
>> Well, yes it will have that side effect. You can cure that with a
>> copyField directive from content to _text_.
>>
>> But do really consider running this as a SolrJ program on the client.
>> Tim knows in far more painful detail than I do what kinds of problems
>> there are when parsing all the different formats so I'd _really_
>> follow his advice.
>>
>> Tika pretty much has an impossible job. "Here, try to parse all these
>> different formats, implemented by different vendors with different
>> versions that more or less follow a spec which really isn't a spec in
>> many cases just recommendations using packages that may or may not be
>> actively maintained. And by the way, we'll try to handle that 1G
>> document that someone sends us, but don't blame us if we hit an
>> OOM.....". When Tika is run on the same box as Solr any problems in
>> that entire chain can adversely affect your search.
>>
>> Not to mention that Tika has to do some heavy lifting, using CPU
>> cycles that are unavailable for Solr.
>>
>> Extracting Request Handler is a fine way to get started, but for
>> production seriously consider a separate client.
>>
>> Best,
>> Erick
>>
>> On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan <ziyu...@gmail.com> wrote:
>> > Hi Erick,
>> >
>> > Now it is clear. I have to update the request handler of
>> /update/extract/
>> > from
>> > "defaults":{"fmap.content":"_text_"}
>> > to
>> > "defaults":{"fmap.content":"content"}
>> > to fill the field.
>> >
>> > Hope that there is no side effect of not mapping the PDF content to
>> _text_.
>> > Thank you for the hint.
>> >
>> > Best regards,
>> > Ziyuan
>> >
>> > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher
>> > <erik.hatc...@gmail.com>
>> > wrote:
>> >
>> >> Ziyuan -
>> >>
>> >> You may be interested in the example/files that ships with Solr too.
>> It’s
>> >> got schema and config and even UI for file indexing and searching.
>>  Check
>> >> it out README.txt under example/files in your Solr install.
>> >>
>> >>         Erik
>> >>
>> >> > On Jun 19, 2017, at 6:52 AM, ZiYuan <ziyu...@gmail.com> wrote:
>> >> >
>> >> > Hi Erick,
>> >> >
>> >> > thanks very much for the explanations! Clarification for question 2:
>> more
>> >> > specifically I cannot see the field content in the returned
>> >> > JSON,
>> with
>> >> the
>> >> > the same definitions as in the post
>> >> > <http://www.codewrecks.com/blog/index.php/2013/05/27/
>> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/
>> >> >
>> >> > :
>> >> >
>> >> > <field name="content" type="text_general" indexed="false"
>> stored="true"/>
>> >> > <field name="text" type="text_general" multiValued="true"
>> indexed="true"
>> >> > stored="false"/>
>> >> > <copyField source="content" dest="text"/>
>> >> >
>> >> > Is it so that Tika does not fill these two fields automatically
>> >> > and I
>> >> have
>> >> > to write some client code to fill them?
>> >> >
>> >> > Best regards,
>> >> > Ziyuan
>> >> >
>> >> >
>> >> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <
>> erickerick...@gmail.com
>> >> >
>> >> > wrote:
>> >> >
>> >> >> 1> Yes, you can use your single definition. The author
>> >> >> 1> identifies
>> the
>> >> >> "text" field as a catch-all. Somewhere in the schema there'll
>> >> >> be a copyField directive copying (perhaps) many different
>> >> >> fields to the "text" field. That permits simple searches
>> >> >> against a single field rather than, say, using edismax to
>> >> >> search across multiple separate fields.
>> >> >>
>> >> >> 2> The link you referenced is for Data Import Handler, which is
>> >> >> 2> much
>> >> >> different than just posting files to Solr. See
>> >> >> ExtractingRequestHandler:
>> >> >> https://cwiki.apache.org/confluence/display/solr/
>> >> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
>> >> >> There are ways to map meta-data fields from the doc into
>> >> >> specific fields matching your schema. Be a little careful here.
>> >> >> There is no standard across different types of docs as to what
>> >> >> meta-data field
>> is
>> >> >> included. PDF might have a "last_edited" field. Word might have
>> >> >> a "last_modified" field where the two mean the same thing.
>> >> >> Here's a
>> link
>> >> >> to a SolrJ program that'll dump all the fields:
>> >> >> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can
>> easily
>> >> >> hack out the DB bits.
>> >> >>
>> >> >> BTW, once you get more familiar with processing, I strongly
>> recommend
>> >> >> you do the document processing on the client, the reasons are
>> outlined
>> >> >> in that article.
>> >> >>
>> >> >> bq: even I define the fields as he said I cannot see them in
>> >> >> the search results as keys in JSON are the fields set as
>> >> >> stored="true"? They must be to be returned in requests
>> >> >> (skipping the docValues discussion here).
>> >> >>
>> >> >> 3> Yes, the text field is a concatenation of all the other ones.
>> >> >> Because it has stored=false, you can only search it, you cannot
>> >> >> highlight or view. Fields you highlight must have stored=true BTW.
>> >> >>
>> >> >> Whether or not you can highlight "Trevor Hastie" depends an a
>> >> >> lot of things, most particularly whether that text is ever
>> >> >> actually in a field in your index. Just because there's no
>> >> >> guarantee that the name of the file is indexed in a 
>> >> >> searchable/highlightable way.
>> >> >>
>> >> >> And the query q=id:Trevor Hastie won't do what you think. It'll
>> >> >> be
>> >> parsed
>> >> >> as
>> >> >> id:Trevor _text_:Hastie
>> >> >> _text_ is the default field, look for a "df" parameter in your
>> request
>> >> >> handler in solrconfig.xml (usually "/select" or "/query").
>> >> >>
>> >> >> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <ziyu...@gmail.com> wrote:
>> >> >>> Hi,
>> >> >>>
>> >> >>> I am new to Solr and I need to implement a full-text search of
>> some PDF
>> >> >>> files. The indexing part works out of the box by using
>> >> >>> bin/post. I
>> can
>> >> >> see
>> >> >>> search results in the admin UI given some queries, though
>> >> >>> without
>> the
>> >> >>> matched texts and the context.
>> >> >>>
>> >> >>> Now I am reading this post
>> >> >>> <http://www.codewrecks.com/blog/index.php/2013/05/27/
>> >> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-ti
>> >> >> ka/>
>> >> >>> for the highlighting part. It is for an older version of Solr
>> >> >>> when
>> >> >> managed
>> >> >>> schema was not available. Before fully understand what it is
>> >> >>> doing
>> I
>> >> have
>> >> >>> some questions:
>> >> >>>
>> >> >>> 1. He defined two fields:
>> >> >>>
>> >> >>> <field name="content" type="text_general" indexed="false"
>> stored="true"
>> >> >>> multiValued="false"/>
>> >> >>> <field name="text" type="text_general" indexed="true"
>> stored="false"
>> >> >>> multiValued="true"/>
>> >> >>>
>> >> >>> But why are there two fields needed? Can I define a field
>> >> >>>
>> >> >>> <field name="content" type="text_general" indexed="true"
>> stored="true"
>> >> >>> multiValued="true"/>
>> >> >>>
>> >> >>> to capture the full text?
>> >> >>>
>> >> >>> 2. How are the fields filled? I don't see relevant information
>> >> >>> in TikaEntityProcessor's documentation
>> >> >>> <https://lucene.apache.org/solr/6_6_0/solr-
>> >> dataimporthandler-extras/org/
>> >> >> apache/solr/handler/dataimport/TikaEntityProcessor.html#
>> >> >> fields.inherited.from.class.org.apache.solr.handler.
>> >> >> dataimport.EntityProcessorBase>.
>> >> >>> The current text extractor should already be Tika (I can see
>> >> >>>
>> >> >>> "x_parsed_by":
>> >> >>> ["org.apache.tika.parser.DefaultParser","org.apache.
>> >> >> tika.parser.pdf.PDFParser"]
>> >> >>>
>> >> >>> in the returned JSON of some query). But even I define the
>> >> >>> fields
>> as he
>> >> >>> said I cannot see them in the search results as keys in JSON.
>> >> >>>
>> >> >>> 3. The _text_ field seems a concatenation of other fields,
>> >> >>> does it
>> >> >> contain
>> >> >>> the full text? Though it does not seem to be accessible by default.
>> >> >>>
>> >> >>> To be brief, using The Elements of Statistical Learning
>> >> >>> <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/
>> >> >> ESLII_print10.pdf>
>> >> >>> as an example, how to highlight the relevant texts for the
>> >> >>> query
>> "SVM"?
>> >> >> And
>> >> >>> if changing the file name into "The Elements of Statistical
>> Learning -
>> >> >>> Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie"
>> for
>> >> the
>> >> >>> query "id:Trevor Hastie"?
>> >> >>>
>> >> >>> Thank you.
>> >> >>>
>> >> >>> Best regards,
>> >> >>> Ziyuan
>> >> >>
>> >>
>> >>
>>
>
>
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Reply via email to