Provide multiple output formats in extract-only mode for tika handler
---------------------------------------------------------------------

                 Key: SOLR-1274
                 URL: https://issues.apache.org/jira/browse/SOLR-1274
             Project: Solr
          Issue Type: New Feature
    Affects Versions: 1.4
            Reporter: Peter Wolanin
            Priority: Minor
             Fix For: 1.4


The proposed feature is to accept a URL parameter when using extract-only mode 
to specify an output format.  This parameter might just overload the existing 
"ext.extract.only" so that one can optionally specify a format, e.g. 
false|true|xml|text  where true and xml give the same response (i.e. xml 
remains the default)

I had been assuming that I could choose among possible tika output
formats when using the extracting request handler in extract-only mode
as if from the CLI with the tika jar:

   -x or --xml        Output XHTML content (default)
   -h or --html       Output HTML content
   -t or --text       Output plain text content
   -m or --metadata   Output only metadata

However, looking at the docs and source, it seems that only the xml
option is available (hard-coded) in ExtractingDocumentLoader.java
{code}
serializer = new XMLSerializer(writer, new OutputFormat("XML", "UTF-8", true));
{code}

Providing at least a plain-text response seems to work if you change the 
serializer to a TextSerializer (org.apache.xml.serialize.TextSerializer).





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to