Hi all,

Currently I am trying to do index documents from different kinds with Solr
and tika. It's working fine but when solr returns the content of the
document. Doesn't return the plain text.  It comes back as well with some
metadata. 

For instance my request.
http://localhost:8983/solr/document/update/extract?extractOnly=true&stream.file=C:\TIKA\FileTest\Test.txt

Content of Test.txt file is just "*Test File*".

Response from Solr as you can see below returns plenty of information.
I would the answer to be something like this without noise for the search.
<str name="Test.txt">
Test File
</str>

<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">135</int>
</lst>
<str name="Test.txt">
<?xml version="1.0" encoding="UTF-8"?> <html
xmlns="http://www.w3.org/1999/xhtml";> <head> <meta name="stream_size"
content="13"/> <meta name="X-Parsed-By"
content="org.apache.tika.parser.DefaultParser"/> <meta name="X-Parsed-By"
content="org.apache.tika.parser.txt.TXTParser"/> <meta name="stream_name"
content="Test.txt"/> <meta name="stream_source_info"
content="file:/C:/TIKA/FileTest/Test.txt"/> <meta name="Content-Encoding"
content="ISO-8859-1"/> <meta name="Content-Type" content="text/plain;
charset=ISO-8859-1"/> <title></title> </head> <body> <p>Test File</p>
</body> </html>
</str>
<lst name="Test.txt_metadata">
<arr name="stream_size">
<str>13</str>
</arr>
<arr name="X-Parsed-By">
<str>org.apache.tika.parser.DefaultParser</str>
<str>org.apache.tika.parser.txt.TXTParser</str>
</arr>
<arr name="stream_name">
<str>Test.txt</str>
</arr>
<arr name="stream_source_info">
<str>file:/C:/TIKA/FileTest/Test.txt</str>
</arr>
<arr name="Content-Encoding">
<str>ISO-8859-1</str>
</arr>
<arr name="Content-Type">
<str>text/plain; charset=ISO-8859-1</str>
</arr>
</lst>
</response>

Can anyone give some light here?
Thanks  a lot.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply via email to