Re: Bypassing ExtractingRequestHandler

Chris Mattmann Tue, 14 Jun 2016 12:56:14 -0700

Hey Tim sounds great to me..

—
Chris Mattmann
[email protected]








On 6/14/16, 8:53 AM, "Allison, Timothy B." <[email protected]> wrote:

>Oh, wow.  Y, that's probably more than we'd want to support (unless any other 
>Tika devs have an interest?)...very, very cool!
>
>
>-----Original Message-----
>From: Justin Lee [mailto:[email protected]] 
>Sent: Monday, June 13, 2016 5:05 PM
>To: [email protected]
>Subject: Re: Bypassing ExtractingRequestHandler
>
>Thanks everyone for the help and advice.  The SolrJ exmaple makes sense to me. 
> The import of SOLR-8166 was kind of mind boggling to me, but maybe I'll 
>revisit after some time.
>
>Tim: for context, I'm ultimately trying to create an external highlighter.
>See https://issues.apache.org/jira/browse/SOLR-1397.  I want to store the 
>bounding box (in PDF units) for each token in the extracted text stream.
>Then when I get results from Solr using the above patch, I'll convert the
>UTF-16 offsets into X/Y coordinates and perform highlighting as appropriate in 
>the UI.  I like this approach because I get highlighting that accurately 
>reflects the search, even when the search is complex (e.g. wildcards or 
>proximity searches).
>
>I think it would take quite a bit of thinking to get something general enough 
>to add into Tika.  For example, what units?  Take a look at the discussion of 
>what units to report offsets in here:
>https://issues.apache.org/jira/browse/SOLR-1954 (see the comments by Robert 
>Muir -- although whatever issues there are here they are the same as the 
>offsets reported in the Term Vector Component, it would seem to me).  As 
>another example, I'm just not sure what format is general enough to make sense 
>for everybody.  I think I'll just create a mapping from UTF-16 offsets into 
>(x1,y1) (x2,y2) pairs, dump it into a JSON blob, and store that in a NoSQL 
>store.  Then, when I get Solr results, I'll look at the matching offsets, the 
>JSON blob, and the original document and be on my merry way.  I'm happy to 
>open a JIRA entry in Tika if you think this is a coherent request.
>
>The other approach, I suppose, is to try to pass the information along during 
>indexing and store as a token payload.  But it seems like the indexing 
>interface is really text oriented.  I have also thought about using 
>DelimitedPayloadTokenFilter, which will increase the index size I imagine (how 
>much, though?) and require more customization of Solr internals.  I don't know 
>which is the better approach.
>
>On Mon, Jun 13, 2016 at 7:22 AM Allison, Timothy B. <[email protected]>
>wrote:
>
>>
>>
>>
>> >Two things: Here's a sample bit of SolrJ code, pulling out the DB 
>> >stuff
>> should be straightforward:
>> http://searchhub.org/2012/02/14/indexing-with-solrj/
>>
>> +1
>>
>> > We tend to prefer running Tika externally as it's entirely possible 
>> > that Tika will crash or hang with certain files - and that will 
>> > bring down Solr if you're running Tika within it.
>>
>> +1
>>
>> >> I want to make a small modification to Tika to get and save 
>> >> additional data from my PDFs
>> What info do you need, and if it is common enough, could you ask over 
>> on Tika's JIRA and we'll try to add it directly?
>>
>>
>>
>>

Re: Bypassing ExtractingRequestHandler

Reply via email to