Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Erik Hatcher Mon, 19 Jun 2017 03:58:38 -0700

Ziyuan -

You may be interested in the example/files that ships with Solr too.  It’s got 
schema and config and even UI for file indexing and searching.   Check it out 
README.txt under example/files in your Solr install.


        Erik

> On Jun 19, 2017, at 6:52 AM, ZiYuan <ziyu...@gmail.com> wrote:
> 
> Hi Erick,
> 
> thanks very much for the explanations! Clarification for question 2: more
> specifically I cannot see the field content in the returned JSON, with the
> the same definitions as in the post
> <http://www.codewrecks.com/blog/index.php/2013/05/27/hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
> :
> 
> <field name="content" type="text_general" indexed="false" stored="true"/>
> <field name="text" type="text_general" multiValued="true" indexed="true"
> stored="false"/>
> <copyField source="content" dest="text"/>
> 
> Is it so that Tika does not fill these two fields automatically and I have
> to write some client code to fill them?
> 
> Best regards,
> Ziyuan
> 
> 
> On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
> 
>> 1> Yes, you can use your single definition. The author identifies the
>> "text" field as a catch-all. Somewhere in the schema there'll be a
>> copyField directive copying (perhaps) many different fields to the
>> "text" field. That permits simple searches against a single field
>> rather than, say, using edismax to search across multiple separate
>> fields.
>> 
>> 2> The link you referenced is for Data Import Handler, which is much
>> different than just posting files to Solr. See
>> ExtractingRequestHandler:
>> https://cwiki.apache.org/confluence/display/solr/
>> Uploading+Data+with+Solr+Cell+using+Apache+Tika.
>> There are ways to map meta-data fields from the doc into specific
>> fields matching your schema. Be a little careful here. There is no
>> standard across different types of docs as to what meta-data field is
>> included. PDF might have a "last_edited" field. Word might have a
>> "last_modified" field where the two mean the same thing. Here's a link
>> to a SolrJ program that'll dump all the fields:
>> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can easily
>> hack out the DB bits.
>> 
>> BTW, once you get more familiar with processing, I strongly recommend
>> you do the document processing on the client, the reasons are outlined
>> in that article.
>> 
>> bq: even I define the fields as he said I cannot see them in the
>> search results as keys in JSON
>> are the fields set as stored="true"? They must be to be returned in
>> requests (skipping the docValues discussion here).
>> 
>> 3> Yes, the text field is a concatenation of all the other ones.
>> Because it has stored=false, you can only search it, you cannot
>> highlight or view. Fields you highlight must have stored=true BTW.
>> 
>> Whether or not you can highlight "Trevor Hastie" depends an a lot of
>> things, most particularly whether that text is ever actually in a
>> field in your index. Just because there's no guarantee that the name
>> of the file is indexed in a searchable/highlightable way.
>> 
>> And the query q=id:Trevor Hastie won't do what you think. It'll be parsed
>> as
>> id:Trevor _text_:Hastie
>> _text_ is the default field, look for a "df" parameter in your request
>> handler in solrconfig.xml (usually "/select" or "/query").
>> 
>> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <ziyu...@gmail.com> wrote:
>>> Hi,
>>> 
>>> I am new to Solr and I need to implement a full-text search of some PDF
>>> files. The indexing part works out of the box by using bin/post. I can
>> see
>>> search results in the admin UI given some queries, though without the
>>> matched texts and the context.
>>> 
>>> Now I am reading this post
>>> <http://www.codewrecks.com/blog/index.php/2013/05/27/
>> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/>
>>> for the highlighting part. It is for an older version of Solr when
>> managed
>>> schema was not available. Before fully understand what it is doing I have
>>> some questions:
>>> 
>>> 1. He defined two fields:
>>> 
>>> <field name="content" type="text_general" indexed="false" stored="true"
>>> multiValued="false"/>
>>> <field name="text" type="text_general" indexed="true" stored="false"
>>> multiValued="true"/>
>>> 
>>> But why are there two fields needed? Can I define a field
>>> 
>>> <field name="content" type="text_general" indexed="true" stored="true"
>>> multiValued="true"/>
>>> 
>>> to capture the full text?
>>> 
>>> 2. How are the fields filled? I don't see relevant information in
>>> TikaEntityProcessor's documentation
>>> <https://lucene.apache.org/solr/6_6_0/solr-dataimporthandler-extras/org/
>> apache/solr/handler/dataimport/TikaEntityProcessor.html#
>> fields.inherited.from.class.org.apache.solr.handler.
>> dataimport.EntityProcessorBase>.
>>> The current text extractor should already be Tika (I can see
>>> 
>>> "x_parsed_by":
>>> ["org.apache.tika.parser.DefaultParser","org.apache.
>> tika.parser.pdf.PDFParser"]
>>> 
>>> in the returned JSON of some query). But even I define the fields as he
>>> said I cannot see them in the search results as keys in JSON.
>>> 
>>> 3. The _text_ field seems a concatenation of other fields, does it
>> contain
>>> the full text? Though it does not seem to be accessible by default.
>>> 
>>> To be brief, using The Elements of Statistical Learning
>>> <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/
>> ESLII_print10.pdf>
>>> as an example, how to highlight the relevant texts for the query "SVM"?
>> And
>>> if changing the file name into "The Elements of Statistical Learning -
>>> Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the
>>> query "id:Trevor Hastie"?
>>> 
>>> Thank you.
>>> 
>>> Best regards,
>>> Ziyuan
>>

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

Reply via email to