Hey Tim sounds great to me.. — Chris Mattmann [email protected]
On 6/14/16, 8:53 AM, "Allison, Timothy B." <[email protected]> wrote: >Oh, wow. Y, that's probably more than we'd want to support (unless any other >Tika devs have an interest?)...very, very cool! > > >-----Original Message----- >From: Justin Lee [mailto:[email protected]] >Sent: Monday, June 13, 2016 5:05 PM >To: [email protected] >Subject: Re: Bypassing ExtractingRequestHandler > >Thanks everyone for the help and advice. The SolrJ exmaple makes sense to me. > The import of SOLR-8166 was kind of mind boggling to me, but maybe I'll >revisit after some time. > >Tim: for context, I'm ultimately trying to create an external highlighter. >See https://issues.apache.org/jira/browse/SOLR-1397. I want to store the >bounding box (in PDF units) for each token in the extracted text stream. >Then when I get results from Solr using the above patch, I'll convert the >UTF-16 offsets into X/Y coordinates and perform highlighting as appropriate in >the UI. I like this approach because I get highlighting that accurately >reflects the search, even when the search is complex (e.g. wildcards or >proximity searches). > >I think it would take quite a bit of thinking to get something general enough >to add into Tika. For example, what units? Take a look at the discussion of >what units to report offsets in here: >https://issues.apache.org/jira/browse/SOLR-1954 (see the comments by Robert >Muir -- although whatever issues there are here they are the same as the >offsets reported in the Term Vector Component, it would seem to me). As >another example, I'm just not sure what format is general enough to make sense >for everybody. I think I'll just create a mapping from UTF-16 offsets into >(x1,y1) (x2,y2) pairs, dump it into a JSON blob, and store that in a NoSQL >store. Then, when I get Solr results, I'll look at the matching offsets, the >JSON blob, and the original document and be on my merry way. I'm happy to >open a JIRA entry in Tika if you think this is a coherent request. > >The other approach, I suppose, is to try to pass the information along during >indexing and store as a token payload. But it seems like the indexing >interface is really text oriented. I have also thought about using >DelimitedPayloadTokenFilter, which will increase the index size I imagine (how >much, though?) and require more customization of Solr internals. I don't know >which is the better approach. > >On Mon, Jun 13, 2016 at 7:22 AM Allison, Timothy B. <[email protected]> >wrote: > >> >> >> >> >Two things: Here's a sample bit of SolrJ code, pulling out the DB >> >stuff >> should be straightforward: >> http://searchhub.org/2012/02/14/indexing-with-solrj/ >> >> +1 >> >> > We tend to prefer running Tika externally as it's entirely possible >> > that Tika will crash or hang with certain files - and that will >> > bring down Solr if you're running Tika within it. >> >> +1 >> >> >> I want to make a small modification to Tika to get and save >> >> additional data from my PDFs >> What info do you need, and if it is common enough, could you ask over >> on Tika's JIRA and we'll try to add it directly? >> >> >> >>
