[jira] Commented: (SOLR-284) Parsing Rich Document Types

Grant Ingersoll (JIRA) Sat, 22 Nov 2008 16:12:08 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649986#action_12649986
 ]


Grant Ingersoll commented on SOLR-284:
--------------------------------------

{quote}
I think I like where this is going.
{quote}

Great!  I think the nice thing is as Tika grows, we'll get many more formats 
all for free.  For instance, I saw someone working on a Flash extractor.

{quote}
Currently the default is ext.ignore.und.fl (IGNORE_UNDECLARED_FIELDS) == false, 
which means that if Tika returns a metadata field and you haven't made an 
explicit mapping from the Tika fieldname to your Solr fieldname, then Solr will 
throw an exception and your document add will fail. This doesn't seem sound 
very robust for a production environment, unless Tika will only ever use a 
finite list of metadata field names. (That doesn't sound plausible, though I 
admit I haven't looked into it.) Even in that case, I think I'd rather not have 
to set up a mapping for every possible field name in order to get started with 
this handler. Would true perhaps be a better default?
{quote}

I guess I was thinking that most people will probably start out with this by 
sending their docs through the engine and see what happens.  I think an 
exception helps them see sooner what they are missing.  That being said, I 
don't feel particularly strong about it.   It's easy enough to set it to true 
in the request handler mappings.    From what I see of Tika, though, the 
possible values for metadata is fixed within a version.  Perhaps the bigger 
issue is what happens when someone updates Tika to a newer version with newer 
Metadata options.

{quote}
ext.capture / CAPTURE_FIELDS: Do you have a use case in mind for this feature, 
Grant? The example in the patch is of routing text from <div> tags to one Solr 
field while routing text from other tags to a different Solr field. I'm kind of 
curious when this would be useful, especially keeping in mind that, in general, 
Tika source documents are not HTML, and so when <div> tags are generated 
they're as much artifacts of Tika as reflecting anything in the underlying 
document. (You could maybe ask a similar question about ext.inx.attr / 
INDEX_ATTRIBUTES.)
{quote}

For capture fields, it's similar to a copy field function.  Say, for example, 
you want a whole document in one field, but also to be able to search within 
paragraphs.  Then, you could use a capture field on a <p> tag to do that.  
Thus, you get the best of both worlds.  The Tika output, is XHTML.

Also, since extraction is happening on the server side, I want to make sure we 
have lots of options for dealing with the content.  I don't know where else one 
would have options to muck with the content post-extraction, but pre-indexing.  
Hooking into the processor chain is too late, since then the Tika structure is 
gone.  That's my reasoning, anyway.  

Similarly, for index attributes.  When extracting from an HTML file, and it 
comes across anchor tags (<a>) it will provide the attributes of the tags as 
XML attributes.  So, one may want to extract out the links separately from the 
main content and put them into a separate field.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, 
> SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, 
> test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Reply via email to