[ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726122#action_12726122 ]
Yonik Seeley commented on SOLR-284: ----------------------------------- >> I just tried setting ext.idx.attr=false, and I didn't see any change after >> indexing a PDF. > This is often needed for HTML, where it is used to index the attributes of > tags. Same would go for XML. That's confusing given that the examples on the wiki show PDFs being indexed with ext.idx.attr=true It also confused me since the docs say "Index the Tika XHTML attributes into separate fields, named after the attribute." and the docs also say "Tika does everything by producing an XHTML stream that it feeds to a SAX ContentHandler". That led me to believe that ext.idx.attr was for all tika generated metadata (or maybe it is, but tika doesn't generally use attributes?) It's also rather confusing just what rules can be applied to what. For example, does ext.metadata.prefix work on stuff produced by ext.idx.attr? > Parsing Rich Document Types > --------------------------- > > Key: SOLR-284 > URL: https://issues.apache.org/jira/browse/SOLR-284 > Project: Solr > Issue Type: New Feature > Components: update > Reporter: Eric Pugh > Assignee: Grant Ingersoll > Fix For: 1.4 > > Attachments: libs.zip, rich.patch, rich.patch, rich.patch, > rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, > SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, > SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, > test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff > > > I have developed a RichDocumentRequestHandler based on the CSVRequestHandler > that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into > Solr. > There is a wiki page with information here: > http://wiki.apache.org/solr/UpdateRichDocuments > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.