Dear Erick and Timothy, yes I will parse from the client for all the benefits. I am just trying to figure out what is going on by indexing one or two PDF files first. Thank you both.
Best regards, Ziyuan On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson <erickerick...@gmail.com> wrote: > bq: Hope that there is no side effect of not mapping the PDF > > Well, yes it will have that side effect. You can cure that with a > copyField directive from content to _text_. > > But do really consider running this as a SolrJ program on the client. > Tim knows in far more painful detail than I do what kinds of problems > there are when parsing all the different formats so I'd _really_ > follow his advice. > > Tika pretty much has an impossible job. "Here, try to parse all these > different formats, implemented by different vendors with different > versions that more or less follow a spec which really isn't a spec in > many cases just recommendations using packages that may or may not be > actively maintained. And by the way, we'll try to handle that 1G > document that someone sends us, but don't blame us if we hit an > OOM.....". When Tika is run on the same box as Solr any problems in > that entire chain can adversely affect your search. > > Not to mention that Tika has to do some heavy lifting, using CPU > cycles that are unavailable for Solr. > > Extracting Request Handler is a fine way to get started, but for > production seriously consider a separate client. > > Best, > Erick > > On Mon, Jun 19, 2017 at 6:24 AM, ZiYuan <ziyu...@gmail.com> wrote: > > Hi Erick, > > > > Now it is clear. I have to update the request handler of /update/extract/ > > from > > "defaults":{"fmap.content":"_text_"} > > to > > "defaults":{"fmap.content":"content"} > > to fill the field. > > > > Hope that there is no side effect of not mapping the PDF content to > _text_. > > Thank you for the hint. > > > > Best regards, > > Ziyuan > > > > On Mon, Jun 19, 2017 at 1:55 PM, Erik Hatcher <erik.hatc...@gmail.com> > > wrote: > > > >> Ziyuan - > >> > >> You may be interested in the example/files that ships with Solr too. > It’s > >> got schema and config and even UI for file indexing and searching. > Check > >> it out README.txt under example/files in your Solr install. > >> > >> Erik > >> > >> > On Jun 19, 2017, at 6:52 AM, ZiYuan <ziyu...@gmail.com> wrote: > >> > > >> > Hi Erick, > >> > > >> > thanks very much for the explanations! Clarification for question 2: > more > >> > specifically I cannot see the field content in the returned JSON, with > >> the > >> > the same definitions as in the post > >> > <http://www.codewrecks.com/blog/index.php/2013/05/27/ > >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/> > >> > : > >> > > >> > <field name="content" type="text_general" indexed="false" > stored="true"/> > >> > <field name="text" type="text_general" multiValued="true" > indexed="true" > >> > stored="false"/> > >> > <copyField source="content" dest="text"/> > >> > > >> > Is it so that Tika does not fill these two fields automatically and I > >> have > >> > to write some client code to fill them? > >> > > >> > Best regards, > >> > Ziyuan > >> > > >> > > >> > On Sun, Jun 18, 2017 at 8:07 PM, Erick Erickson < > erickerick...@gmail.com > >> > > >> > wrote: > >> > > >> >> 1> Yes, you can use your single definition. The author identifies the > >> >> "text" field as a catch-all. Somewhere in the schema there'll be a > >> >> copyField directive copying (perhaps) many different fields to the > >> >> "text" field. That permits simple searches against a single field > >> >> rather than, say, using edismax to search across multiple separate > >> >> fields. > >> >> > >> >> 2> The link you referenced is for Data Import Handler, which is much > >> >> different than just posting files to Solr. See > >> >> ExtractingRequestHandler: > >> >> https://cwiki.apache.org/confluence/display/solr/ > >> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika. > >> >> There are ways to map meta-data fields from the doc into specific > >> >> fields matching your schema. Be a little careful here. There is no > >> >> standard across different types of docs as to what meta-data field is > >> >> included. PDF might have a "last_edited" field. Word might have a > >> >> "last_modified" field where the two mean the same thing. Here's a > link > >> >> to a SolrJ program that'll dump all the fields: > >> >> https://lucidworks.com/2012/02/14/indexing-with-solrj/. You can > easily > >> >> hack out the DB bits. > >> >> > >> >> BTW, once you get more familiar with processing, I strongly recommend > >> >> you do the document processing on the client, the reasons are > outlined > >> >> in that article. > >> >> > >> >> bq: even I define the fields as he said I cannot see them in the > >> >> search results as keys in JSON > >> >> are the fields set as stored="true"? They must be to be returned in > >> >> requests (skipping the docValues discussion here). > >> >> > >> >> 3> Yes, the text field is a concatenation of all the other ones. > >> >> Because it has stored=false, you can only search it, you cannot > >> >> highlight or view. Fields you highlight must have stored=true BTW. > >> >> > >> >> Whether or not you can highlight "Trevor Hastie" depends an a lot of > >> >> things, most particularly whether that text is ever actually in a > >> >> field in your index. Just because there's no guarantee that the name > >> >> of the file is indexed in a searchable/highlightable way. > >> >> > >> >> And the query q=id:Trevor Hastie won't do what you think. It'll be > >> parsed > >> >> as > >> >> id:Trevor _text_:Hastie > >> >> _text_ is the default field, look for a "df" parameter in your > request > >> >> handler in solrconfig.xml (usually "/select" or "/query"). > >> >> > >> >> On Sat, Jun 17, 2017 at 3:04 PM, ZiYuan <ziyu...@gmail.com> wrote: > >> >>> Hi, > >> >>> > >> >>> I am new to Solr and I need to implement a full-text search of some > PDF > >> >>> files. The indexing part works out of the box by using bin/post. I > can > >> >> see > >> >>> search results in the admin UI given some queries, though without > the > >> >>> matched texts and the context. > >> >>> > >> >>> Now I am reading this post > >> >>> <http://www.codewrecks.com/blog/index.php/2013/05/27/ > >> >> hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/> > >> >>> for the highlighting part. It is for an older version of Solr when > >> >> managed > >> >>> schema was not available. Before fully understand what it is doing I > >> have > >> >>> some questions: > >> >>> > >> >>> 1. He defined two fields: > >> >>> > >> >>> <field name="content" type="text_general" indexed="false" > stored="true" > >> >>> multiValued="false"/> > >> >>> <field name="text" type="text_general" indexed="true" stored="false" > >> >>> multiValued="true"/> > >> >>> > >> >>> But why are there two fields needed? Can I define a field > >> >>> > >> >>> <field name="content" type="text_general" indexed="true" > stored="true" > >> >>> multiValued="true"/> > >> >>> > >> >>> to capture the full text? > >> >>> > >> >>> 2. How are the fields filled? I don't see relevant information in > >> >>> TikaEntityProcessor's documentation > >> >>> <https://lucene.apache.org/solr/6_6_0/solr- > >> dataimporthandler-extras/org/ > >> >> apache/solr/handler/dataimport/TikaEntityProcessor.html# > >> >> fields.inherited.from.class.org.apache.solr.handler. > >> >> dataimport.EntityProcessorBase>. > >> >>> The current text extractor should already be Tika (I can see > >> >>> > >> >>> "x_parsed_by": > >> >>> ["org.apache.tika.parser.DefaultParser","org.apache. > >> >> tika.parser.pdf.PDFParser"] > >> >>> > >> >>> in the returned JSON of some query). But even I define the fields > as he > >> >>> said I cannot see them in the search results as keys in JSON. > >> >>> > >> >>> 3. The _text_ field seems a concatenation of other fields, does it > >> >> contain > >> >>> the full text? Though it does not seem to be accessible by default. > >> >>> > >> >>> To be brief, using The Elements of Statistical Learning > >> >>> <http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ > >> >> ESLII_print10.pdf> > >> >>> as an example, how to highlight the relevant texts for the query > "SVM"? > >> >> And > >> >>> if changing the file name into "The Elements of Statistical > Learning - > >> >>> Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for > >> the > >> >>> query "id:Trevor Hastie"? > >> >>> > >> >>> Thank you. > >> >>> > >> >>> Best regards, > >> >>> Ziyuan > >> >> > >> > >> >