[jira] Commented: (SOLR-284) Parsing Rich Document Types

Chris Harris (JIRA) Tue, 25 Mar 2008 14:35:32 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582062#action_12582062
 ]


Chris Harris commented on SOLR-284:
-----------------------------------

I'm thinking it would be handy if RichDocumentRequestHandler could support 
indexing text and HTML files, in addition to the fancier formats (pdf, doc, 
etc.). That way I could use RichDocumentRequestHandler for all my indexing 
needs (except commits and optimizes), rather than use it for for some doc types 
but still have to use XmlUpdateRequestHandler for text and HTML docs. Would 
anyone else find this useful?

I skimmed the source, and adding support for text files looks trivial. (It's 
just a pass-through.) And if you had this, then I guess you'd have at least one 
version of HTML support for free; in particular, you could upload your HTML 
file to RichDocumentRequestHandler, telling the handler that the document is in 
plain text format, and then strip off the HTML tags later by using the 
HTMLStripStandardTokenizer in your schema.xml.

Alternatively, RichDocumentRequestHandler could provide its own explicit HTML 
to text conversion. There would probably be some advantages to this, but I'm 
not sure exactly what they would be. One, I guess, would be that you could use 
tokenizers that didn't make use of HTMLStripReader.

> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Eric Pugh
>             Fix For: 1.3
>
>         Attachments: libs.zip, rich.patch, source.zip, test-files.zip, 
> test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-284) Parsing Rich Document Types

Reply via email to