sorry for my late reply. thanks for sharing yes this is possible.
maybe my last mail were confusing. I hope the examples below help Alternative 1 - Use only DIH without update processor tika-data-config-2xml - add transformer in entity and the transformation in field (here done for id and for fulltext) - additioanlly set TikaEntityProcessor format to "text": <dataConfig> <dataSource type="BinFileDataSource" /> <document> <entity name="files" dataSource="null" rootEntity="false" processor="FileListEntityProcessor" baseDir="d:\normalized\webcontent\bibleforchildren.org" fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)|(pptx)|(xls)|(xlsx)|(txt)|(htm)|(html)" onError="skip" recursive="true" transformer="RegexTransformer"> <field column="fileAbsolutePath" name="id" regex="^\w|\.]" replaceWith=""/> <field column="fileSize" name="size" /> <field column="fileLastModified" name="lastModified" /> <entity name="documentImport" processor="TikaEntityProcessor" url="${files.fileAbsolutePath}" format="text" transformer="RegexTransformer"> <field column="file" name="fileName"/> <field column="description" name="description" meta="true"/> <field column="title" name="title" meta="true"/> <field column="mime_type" name="type" meta="true"/> <field column="text" name="fulltext" regex="\n|\r" replaceWith=" "/> <field column="keywords" name="keywords" meta="true"/> <field column="count" name="page_count" meta="true"/> <field column="dc:terms" name="keywords_alt" meta="true"/> <field column="Content-Type" name="content_type" meta="true"/> <field column="xmpTPg:NPages" name="page_count_alt" meta="true"/> </entity> </entity> </document> </dataConfig> Alternative 2 - Regex processor in solrconfig.xml - you need to put everything into ONE chain <updateRequestProcessorChain name="my-chain"> <processor class="solr.HTMLStripFieldUpdateProcessorFactory"> <str name="fieldName">_text_</str> <str name="fieldName">fulltext</str> </processor> <processor class="solr.RegexReplaceProcessorFactory"> <str name="fieldName">_text_</str> <str name="fieldName">fulltext</str> <str name="pattern">\n|\r</str> <str name="replacement"/> <bool name="literalReplacement">true</bool> </processor> <processor class="solr.RegexReplaceProcessorFactory"> <str name="fieldName">id</str> <str name="fieldName">url</str> <str name="pattern">[^\w|\.]</str> <str name="replacement">/</str> <bool name="literalReplacement">true</bool> </processor> <processor class="solr.LogUpdateProcessorFactory"/> <processor class="solr.DistributedUpdateProcessorFactory"/> <processor class="solr.RunUpdateProcessorFactory"/> </updateRequestProcessorChain> [..] <requestHandler name="/dataimport" class="solr.DataImportHandler"> <lst name="defaults"> <str name="config">tika-data-config-2.xml</str> <str name="update.chain">my-chain</str> </lst> </requestHandler> On Thu, Mar 14, 2019 at 6:41 AM wclarke <wcla...@widernet.org> wrote: > Got each one working individually, but not multiples. Is it possible? > Please see attached files. > > Thanks!!! tika-data-config-2.xml > <http://lucene.472066.n3.nabble.com/file/t494707/tika-data-config-2.xml> > solrconfig.xml > <http://lucene.472066.n3.nabble.com/file/t494707/solrconfig.xml> > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >