Thanks Erick. I do use this strategy for indexing data from DB. It is very flexible for me. I work in a company where .net is the main dev platform , so even more important to separate things.
Does you post mean that functionality for indexing documents in Solr using ExtractRequestHandler doesn't provide the option of Indexing plain data ? On Wed, 14 Nov 2018 at 16:14, Erick Erickson <erickerick...@gmail.com> wrote: > While ERH is find for getting started, as you go toward production > you'll want to consider parsing the data outside of Solr for the > reasons (and example) outlined here: > https://lucidworks.com/2012/02/14/indexing-with-solrj/ > > Best, > Erick > On Wed, Nov 14, 2018 at 6:46 AM Sergio García Maroto <marot...@gmail.com> > wrote: > > > > Thanks a lot Jan. > > That works very well. > > > > I am now trying to index the doc in Solr deleting the extractOnly > parameter > > and can't find any similiar option to get the data indexed in plain > text. I > > am getting the metadata as well, > > This is my request. > > > http://localhost:8983/solr/document/update/extract?iteral.id=DDOC001&stream.file=C > > :\TIKA\FileTest\Test.txt&commit=true&fmap.content=DocContentS > > > > My DocContentS contains > > \n \n stream_size 13 \n X-Parsed-By org.apache.tika.parser.DefaultParser > \n > > X-Parsed-By org.apache.tika.parser.txt.TXTParser \n stream_name Test.txt > > \n stream_source_info file:/C:/TIKA/FileTest/Test.txt \n Content-Encoding > > ISO-8859-1 \n Content-Type text/plain; charset=ISO-8859-1 \n \n \n Prueba > > Sergio \n " > > > > I can't find anywhere how to modify this behaviour. > > > > > > > > > > On Wed, 14 Nov 2018 at 13:06, Jan Høydahl <jan....@cominvent.com> wrote: > > > > > Have you tried to specify &extractFormat=text > > > > > > -- > > > Jan Høydahl, search solution architect > > > Cominvent AS - www.cominvent.com > > > > > > > 14. nov. 2018 kl. 12:09 skrev marotosg <marot...@gmail.com>: > > > > > > > > Hi all, > > > > > > > > Currently I am trying to do index documents from different kinds with > > > Solr > > > > and tika. It's working fine but when solr returns the content of the > > > > document. Doesn't return the plain text. It comes back as well with > some > > > > metadata. > > > > > > > > For instance my request. > > > > > > > > http://localhost:8983/solr/document/update/extract?extractOnly=true&stream.file=C > > > :\TIKA\FileTest\Test.txt > > > > > > > > Content of Test.txt file is just "*Test File*". > > > > > > > > Response from Solr as you can see below returns plenty of > information. > > > > I would the answer to be something like this without noise for the > > > search. > > > > <str name="Test.txt"> > > > > Test File > > > > </str> > > > > > > > > <response> > > > > <lst name="responseHeader"> > > > > <int name="status">0</int> > > > > <int name="QTime">135</int> > > > > </lst> > > > > <str name="Test.txt"> > > > > <?xml version="1.0" encoding="UTF-8"?> <html > > > > xmlns="http://www.w3.org/1999/xhtml"> <head> <meta > name="stream_size" > > > > content="13"/> <meta name="X-Parsed-By" > > > > content="org.apache.tika.parser.DefaultParser"/> <meta > name="X-Parsed-By" > > > > content="org.apache.tika.parser.txt.TXTParser"/> <meta > name="stream_name" > > > > content="Test.txt"/> <meta name="stream_source_info" > > > > content="file:/C:/TIKA/FileTest/Test.txt"/> <meta > name="Content-Encoding" > > > > content="ISO-8859-1"/> <meta name="Content-Type" content="text/plain; > > > > charset=ISO-8859-1"/> <title></title> </head> <body> <p>Test File</p> > > > > </body> </html> > > > > </str> > > > > <lst name="Test.txt_metadata"> > > > > <arr name="stream_size"> > > > > <str>13</str> > > > > </arr> > > > > <arr name="X-Parsed-By"> > > > > <str>org.apache.tika.parser.DefaultParser</str> > > > > <str>org.apache.tika.parser.txt.TXTParser</str> > > > > </arr> > > > > <arr name="stream_name"> > > > > <str>Test.txt</str> > > > > </arr> > > > > <arr name="stream_source_info"> > > > > <str>file:/C:/TIKA/FileTest/Test.txt</str> > > > > </arr> > > > > <arr name="Content-Encoding"> > > > > <str>ISO-8859-1</str> > > > > </arr> > > > > <arr name="Content-Type"> > > > > <str>text/plain; charset=ISO-8859-1</str> > > > > </arr> > > > > </lst> > > > > </response> > > > > > > > > Can anyone give some light here? > > > > Thanks a lot. > > > > > > > > > > > > > > > > -- > > > > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html > > > > > > >