Have you tried to specify &extractFormat=text -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com
> 14. nov. 2018 kl. 12:09 skrev marotosg <[email protected]>: > > Hi all, > > Currently I am trying to do index documents from different kinds with Solr > and tika. It's working fine but when solr returns the content of the > document. Doesn't return the plain text. It comes back as well with some > metadata. > > For instance my request. > http://localhost:8983/solr/document/update/extract?extractOnly=true&stream.file=C:\TIKA\FileTest\Test.txt > > Content of Test.txt file is just "*Test File*". > > Response from Solr as you can see below returns plenty of information. > I would the answer to be something like this without noise for the search. > <str name="Test.txt"> > Test File > </str> > > <response> > <lst name="responseHeader"> > <int name="status">0</int> > <int name="QTime">135</int> > </lst> > <str name="Test.txt"> > <?xml version="1.0" encoding="UTF-8"?> <html > xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="stream_size" > content="13"/> <meta name="X-Parsed-By" > content="org.apache.tika.parser.DefaultParser"/> <meta name="X-Parsed-By" > content="org.apache.tika.parser.txt.TXTParser"/> <meta name="stream_name" > content="Test.txt"/> <meta name="stream_source_info" > content="file:/C:/TIKA/FileTest/Test.txt"/> <meta name="Content-Encoding" > content="ISO-8859-1"/> <meta name="Content-Type" content="text/plain; > charset=ISO-8859-1"/> <title></title> </head> <body> <p>Test File</p> > </body> </html> > </str> > <lst name="Test.txt_metadata"> > <arr name="stream_size"> > <str>13</str> > </arr> > <arr name="X-Parsed-By"> > <str>org.apache.tika.parser.DefaultParser</str> > <str>org.apache.tika.parser.txt.TXTParser</str> > </arr> > <arr name="stream_name"> > <str>Test.txt</str> > </arr> > <arr name="stream_source_info"> > <str>file:/C:/TIKA/FileTest/Test.txt</str> > </arr> > <arr name="Content-Encoding"> > <str>ISO-8859-1</str> > </arr> > <arr name="Content-Type"> > <str>text/plain; charset=ISO-8859-1</str> > </arr> > </lst> > </response> > > Can anyone give some light here? > Thanks a lot. > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
