Have you tried to specify &extractFormat=text

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 14. nov. 2018 kl. 12:09 skrev marotosg <[email protected]>:
> 
> Hi all,
> 
> Currently I am trying to do index documents from different kinds with Solr
> and tika. It's working fine but when solr returns the content of the
> document. Doesn't return the plain text.  It comes back as well with some
> metadata. 
> 
> For instance my request.
> http://localhost:8983/solr/document/update/extract?extractOnly=true&stream.file=C:\TIKA\FileTest\Test.txt
> 
> Content of Test.txt file is just "*Test File*".
> 
> Response from Solr as you can see below returns plenty of information.
> I would the answer to be something like this without noise for the search.
> <str name="Test.txt">
> Test File
> </str>
> 
> <response>
> <lst name="responseHeader">
> <int name="status">0</int>
> <int name="QTime">135</int>
> </lst>
> <str name="Test.txt">
> <?xml version="1.0" encoding="UTF-8"?> <html
> xmlns="http://www.w3.org/1999/xhtml";> <head> <meta name="stream_size"
> content="13"/> <meta name="X-Parsed-By"
> content="org.apache.tika.parser.DefaultParser"/> <meta name="X-Parsed-By"
> content="org.apache.tika.parser.txt.TXTParser"/> <meta name="stream_name"
> content="Test.txt"/> <meta name="stream_source_info"
> content="file:/C:/TIKA/FileTest/Test.txt"/> <meta name="Content-Encoding"
> content="ISO-8859-1"/> <meta name="Content-Type" content="text/plain;
> charset=ISO-8859-1"/> <title></title> </head> <body> <p>Test File</p>
> </body> </html>
> </str>
> <lst name="Test.txt_metadata">
> <arr name="stream_size">
> <str>13</str>
> </arr>
> <arr name="X-Parsed-By">
> <str>org.apache.tika.parser.DefaultParser</str>
> <str>org.apache.tika.parser.txt.TXTParser</str>
> </arr>
> <arr name="stream_name">
> <str>Test.txt</str>
> </arr>
> <arr name="stream_source_info">
> <str>file:/C:/TIKA/FileTest/Test.txt</str>
> </arr>
> <arr name="Content-Encoding">
> <str>ISO-8859-1</str>
> </arr>
> <arr name="Content-Type">
> <str>text/plain; charset=ISO-8859-1</str>
> </arr>
> </lst>
> </response>
> 
> Can anyone give some light here?
> Thanks  a lot.
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply via email to