Ziyuan - You may be interested in the example/files that ships with Solr too. It’s got schema and config and even UI for file indexing and searching. Check it out README.txt under example/files in your Solr install.
Erik > On Jun 19, 2017, at 6:52 AM, ZiYuan <ziyu...@gmail.com> wrote: > > Hi Erick, > > thanks very much for the explanations! Clarification for question 2: more > specifically I cannot see the field content in the returned JSON, with the > the same definitions as in the post > <http://www.codewrecks.com/blog/index.php/2013/05/27/hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/> > : > > <field name="content" type="text_general" indexed="false" stored="true"/> > <field name="text" type="text_general" multiValued="true" indexed="true" > stored="false"/> > <copyField source="content" dest="text"/> > > Is it so that Tika does not fill these two fields automatically and I have > to write some client code to fill them? > > Best regards, > Ziyuan > > > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > >> 1> Yes, you can use your single definition. The author identifies the >> "text" field as a catch-all. Somewhere in the schema there'll be a >> copyField directive copying (perhaps) many different fields to the >> "text" field. That permits simple searches against a single field >> rather than, say, using edismax to search across multiple separate >> fields. >> >> 2> The link you referenced is for Data Import Handler, which is much >> different than just posting files to Solr. See >> ExtractingRequestHandler: >> https://cwiki.apache.org/confluence/display/solr/ >> Uploading+Data+with+Solr+Cell+using+Apache+Tika. >> There are ways to map meta-data fields from the doc into specific >> fields matching your schema. Be a little careful here. There is no >> standard across different types of docs as to what meta-data field is >> included. PDF might have a "last_edited" field. Word might have a >> "last_modified" field where the two mean the same thing. Here's a link >> to a SolrJ program that'll dump all the fields: >> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can easily >> hack out the DB bits. >> >> BTW, once you get more familiar with processing, I strongly recommend >> you do the document processing on the client, the reasons are outlined >> in that article. >> >> bq: even I define the fields as he said I cannot see them in the >> search results as keys in JSON >> are the fields set as stored="true"? They must be to be returned in >> requests (skipping the docValues discussion here). >> >> 3> Yes, the text field is a concatenation of all the other ones. >> Because it has stored=false, you can only search it, you cannot >> highlight or view. Fields you highlight must have stored=true BTW. >> >> Whether or not you can highlight "Trevor Hastie" depends an a lot of >> things, most particularly whether that text is ever actually in a >> field in your index. Just because there's no guarantee that the name >> of the file is indexed in a searchable/highlightable way. >> >> And the query q=id:Trevor Hastie won't do what you think. It'll be parsed >> as >> id:Trevor _text_:Hastie >> _text_ is the default field, look for a "df" parameter in your request >> handler in solrconfig.xml (usually "/select" or "/query"). >> >> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <ziyu...@gmail.com> wrote: >>> Hi, >>> >>> I am new to Solr and I need to implement a full-text search of some PDF >>> files. The indexing part works out of the box by using bin/post. I can >> see >>> search results in the admin UI given some queries, though without the >>> matched texts and the context. >>> >>> Now I am reading this post >>> <http://www.codewrecks.com/blog/index.php/2013/05/27/ >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/> >>> for the highlighting part. It is for an older version of Solr when >> managed >>> schema was not available. Before fully understand what it is doing I have >>> some questions: >>> >>> 1. He defined two fields: >>> >>> <field name="content" type="text_general" indexed="false" stored="true" >>> multiValued="false"/> >>> <field name="text" type="text_general" indexed="true" stored="false" >>> multiValued="true"/> >>> >>> But why are there two fields needed? Can I define a field >>> >>> <field name="content" type="text_general" indexed="true" stored="true" >>> multiValued="true"/> >>> >>> to capture the full text? >>> >>> 2. How are the fields filled? I don't see relevant information in >>> TikaEntityProcessor's documentation >>> <https://lucene.apache.org/solr/6_6_0/solr-dataimporthandler-extras/org/ >> apache/solr/handler/dataimport/TikaEntityProcessor.html# >> fields.inherited.from.class.org.apache.solr.handler. >> dataimport.EntityProcessorBase>. >>> The current text extractor should already be Tika (I can see >>> >>> "x_parsed_by": >>> ["org.apache.tika.parser.DefaultParser","org.apache. >> tika.parser.pdf.PDFParser"] >>> >>> in the returned JSON of some query). But even I define the fields as he >>> said I cannot see them in the search results as keys in JSON. >>> >>> 3. The _text_ field seems a concatenation of other fields, does it >> contain >>> the full text? Though it does not seem to be accessible by default. >>> >>> To be brief, using The Elements of Statistical Learning >>> <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ >> ESLII_print10.pdf> >>> as an example, how to highlight the relevant texts for the query "SVM"? >> And >>> if changing the file name into "The Elements of Statistical Learning - >>> Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the >>> query "id:Trevor Hastie"? >>> >>> Thank you. >>> >>> Best regards, >>> Ziyuan >>