Re: ExtractRequestHandler and Tika. Get only plain text

Sergio García Maroto Wed, 14 Nov 2018 07:44:05 -0800

Thanks Erick.
I do use this strategy for indexing data from DB. It is very flexible for
me.
I work in a company where .net is the main dev platform , so even more
important to separate things.


Does you post mean that functionality for indexing documents in Solr using
ExtractRequestHandler doesn't provide the option of Indexing plain data ?

On Wed, 14 Nov 2018 at 16:14, Erick Erickson <[email protected]>
wrote:

> While ERH is find for getting started, as you go toward production
> you'll want to consider parsing the data outside of Solr for the
> reasons (and example) outlined here:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
> On Wed, Nov 14, 2018 at 6:46 AM Sergio García Maroto <[email protected]>
> wrote:
> >
> > Thanks a lot Jan.
> > That works very well.
> >
> > I am now trying to index the doc in Solr deleting the extractOnly
> parameter
> > and can't find any similiar option to get the data indexed in plain
> text. I
> > am getting the metadata as well,
> > This is my request.
> >
> http://localhost:8983/solr/document/update/extract?iteral.id=DDOC001&stream.file=C
> > :\TIKA\FileTest\Test.txt&commit=true&fmap.content=DocContentS
> >
> > My DocContentS contains
> > \n \n stream_size 13 \n X-Parsed-By org.apache.tika.parser.DefaultParser
> \n
> > X-Parsed-By org.apache.tika.parser.txt.TXTParser \n stream_name Test.txt
> > \n stream_source_info file:/C:/TIKA/FileTest/Test.txt \n Content-Encoding
> > ISO-8859-1 \n Content-Type text/plain; charset=ISO-8859-1 \n \n \n Prueba
> > Sergio \n "
> >
> > I can't find anywhere how to modify this behaviour.
> >
> >
> >
> >
> > On Wed, 14 Nov 2018 at 13:06, Jan Høydahl <[email protected]> wrote:
> >
> > > Have you tried to specify &extractFormat=text
> > >
> > > --
> > > Jan Høydahl, search solution architect
> > > Cominvent AS - www.cominvent.com
> > >
> > > > 14. nov. 2018 kl. 12:09 skrev marotosg <[email protected]>:
> > > >
> > > > Hi all,
> > > >
> > > > Currently I am trying to do index documents from different kinds with
> > > Solr
> > > > and tika. It's working fine but when solr returns the content of the
> > > > document. Doesn't return the plain text.  It comes back as well with
> some
> > > > metadata.
> > > >
> > > > For instance my request.
> > > >
> > >
> http://localhost:8983/solr/document/update/extract?extractOnly=true&stream.file=C
> > > :\TIKA\FileTest\Test.txt
> > > >
> > > > Content of Test.txt file is just "*Test File*".
> > > >
> > > > Response from Solr as you can see below returns plenty of
> information.
> > > > I would the answer to be something like this without noise for the
> > > search.
> > > > <str name="Test.txt">
> > > > Test File
> > > > </str>
> > > >
> > > > <response>
> > > > <lst name="responseHeader">
> > > > <int name="status">0</int>
> > > > <int name="QTime">135</int>
> > > > </lst>
> > > > <str name="Test.txt">
> > > > <?xml version="1.0" encoding="UTF-8"?> <html
> > > > xmlns="http://www.w3.org/1999/xhtml";> <head> <meta
> name="stream_size"
> > > > content="13"/> <meta name="X-Parsed-By"
> > > > content="org.apache.tika.parser.DefaultParser"/> <meta
> name="X-Parsed-By"
> > > > content="org.apache.tika.parser.txt.TXTParser"/> <meta
> name="stream_name"
> > > > content="Test.txt"/> <meta name="stream_source_info"
> > > > content="file:/C:/TIKA/FileTest/Test.txt"/> <meta
> name="Content-Encoding"
> > > > content="ISO-8859-1"/> <meta name="Content-Type" content="text/plain;
> > > > charset=ISO-8859-1"/> <title></title> </head> <body> <p>Test File</p>
> > > > </body> </html>
> > > > </str>
> > > > <lst name="Test.txt_metadata">
> > > > <arr name="stream_size">
> > > > <str>13</str>
> > > > </arr>
> > > > <arr name="X-Parsed-By">
> > > > <str>org.apache.tika.parser.DefaultParser</str>
> > > > <str>org.apache.tika.parser.txt.TXTParser</str>
> > > > </arr>
> > > > <arr name="stream_name">
> > > > <str>Test.txt</str>
> > > > </arr>
> > > > <arr name="stream_source_info">
> > > > <str>file:/C:/TIKA/FileTest/Test.txt</str>
> > > > </arr>
> > > > <arr name="Content-Encoding">
> > > > <str>ISO-8859-1</str>
> > > > </arr>
> > > > <arr name="Content-Type">
> > > > <str>text/plain; charset=ISO-8859-1</str>
> > > > </arr>
> > > > </lst>
> > > > </response>
> > > >
> > > > Can anyone give some light here?
> > > > Thanks  a lot.
> > > >
> > > >
> > > >
> > > > --
> > > > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> > >
> > >
>

Re: ExtractRequestHandler and Tika. Get only plain text

Reply via email to