>Hi Fergus, > >It seems a field it is expecting is missing from the XML.
You mean there is some field in the document we are indexing that is missing? ><field column="fileAbsPath" template="${jcurrent.fileAbsolutePath}" /> ><field column="fileWebPath" regex="/Volumes/spare/ts/(.*)" replaceWith="$1" >sourceColName="*fileAbsePath*"/> > >I guess "fileAbsePath" is a typo? Can you check if that is the cause? Well spotted. I had made a mess of sanitizing the config file I sent to you. I will in future make sure the stuff I am messing with matches what I send to the list. However there is no typo in the underlying file; at least not on that line:-) > > >On Wed, Jan 21, 2009 at 5:40 PM, Fergus McMenemie <fer...@twig.me.uk> wrote: > >> Shalin >> >> Downloaded nightly for 21jan and tried DIH again. Its better but >> still broken. Dozens of embeded tags are stripped from documents >> but it now fails every few documents for no reason I can see. Manually >> removing embeded tags causes a given problem document to be indexed, >> only to have a it fail on one of the next few documents. I think the >> problem is still in stripHTML >> >> Here is the traceback. >> >> Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start >> INFO: Server startup in 3377 ms >> Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter >> readIndexerProperties >> INFO: Read dataimport.properties >> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute >> INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import} >> status=0 QTime=13 >> Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter >> doFullImport >> INFO: Starting Full Import >> Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2 >> deleteAll >> INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX >> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit >> INFO: SolrDeletionPolicy.onInit: commits:num=2 >> >> >> commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1] >> >> >> commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2] >> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy >> updateCommits >> INFO: last commit = 1232539612131 >> Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder >> buildDocument >> SEVERE: Exception while processing: jc document : null >> org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing >> failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 >> Processing Document # 9 >> at >> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) >> at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252) >> at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177) >> at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) >> at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) >> at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) >> at >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) >> at >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) >> at >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) >> at >> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) >> at >> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) >> Caused by: java.lang.RuntimeException: java.util.NoSuchElementException >> at >> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85) >> at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242) >> ... 9 more >> Caused by: java.util.NoSuchElementException >> at >> com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083) >> at >> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141) >> at >> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174) >> at >> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89) >> at >> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82) >> ... 10 more >> Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter >> doFullImport >> SEVERE: Full Import failed >> org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing >> failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 >> Processing Document # 9 >> at >> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) >> at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252) >> at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177) >> at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) >> at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) >> at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) >> at >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) >> at >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) >> at >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) >> at >> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) >> at >> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) >> Caused by: java.lang.RuntimeException: java.util.NoSuchElementException >> at >> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85) >> at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242) >> ... 9 more >> Caused by: java.util.NoSuchElementException >> at >> com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083) >> at >> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141) >> at >> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174) >> at >> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89) >> at >> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82) >> ... 10 more >> Jan 21, 2009 12:07:40 PM org.apache.solr.update.DirectUpdateHandler2 >> rollback >> INFO: start rollback >> >> >> >> >Ah, it needs a null check for multi valued fields. I've committed a fix to >> >trunk. The next nightly build should have it. You can checkout and build >> >from the trunk if need this immediately. >> > >> >On Mon, Jan 19, 2009 at 7:02 PM, Fergus McMenemie <fer...@twig.me.uk> >> wrote: >> > >> >> Hmmm, >> >> >> >> Just to clarify I retested the thing using the nightly as of today >> >> 18-jan-2009. The problem is still there and this traceback is from >> >> that nightly. >> >> >> >> >>This looks fine. Can you post the stack trace? >> >> >> >> >> >Yep, here is the juicy bit. Let me know if you need more. >> >> > >> >> >Jan 19, 2009 11:08:03 AM org.apache.catalina.startup.Catalina start >> >> >INFO: Server startup in 2390 ms >> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrCore execute >> >> >INFO: [janesdocs] webapp=/solr path=/dataimport >> >> params={command=full-import} status=0 QTime=12 >> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.SolrWriter >> >> readIndexerProperties >> >> >INFO: Read dataimport.properties >> >> >Jan 19, 2009 11:14:06 AM >> org.apache.solr.handler.dataimport.DataImporter >> >> doFullImport >> >> >INFO: Starting Full Import >> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 >> >> deleteAll >> >> >INFO: [janesdocs] REMOVING ALL DOCUMENTS FROM INDEX >> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy onInit >> >> >INFO: SolrDeletionPolicy.onInit: commits:num=2 >> >> > >> >> >> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1232363283058,generation=1,filenames=[segments_1] >> >> > >> >> >> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1232363283059,generation=2,filenames=[segments_2] >> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy >> >> updateCommits >> >> >INFO: last commit = 1232363283059 >> >> >Jan 19, 2009 11:14:06 AM >> >> org.apache.solr.handler.dataimport.EntityProcessorBase applyTransformer >> >> >WARNING: transformer threw error >> >> >java.lang.NullPointerException >> >> > at java.io.StringReader.<init>(StringReader.java:33) >> >> > at >> >> >> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71) >> >> > at >> >> >> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54) >> >> > at >> >> >> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187) >> >> > at >> >> >> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197) >> >> > at >> >> >> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) >> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DocBuilder >> >> buildDocument >> >> >SEVERE: Exception while processing: janescurrent document : null >> >> >org.apache.solr.handler.dataimport.DataImportHandlerException: >> >> java.lang.NullPointerException >> >> > at >> >> >> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64) >> >> > at >> >> >> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203) >> >> > at >> >> >> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197) >> >> > at >> >> >> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) >> >> >Caused by: java.lang.NullPointerException >> >> > at java.io.StringReader.<init>(StringReader.java:33) >> >> > at >> >> >> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71) >> >> > at >> >> >> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54) >> >> > at >> >> >> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187) >> >> > ... 9 more >> >> >Jan 19, 2009 11:14:06 AM >> org.apache.solr.handler.dataimport.DataImporter >> >> doFullImport >> >> >SEVERE: Full Import failed >> >> >org.apache.solr.handler.dataimport.DataImportHandlerException: >> >> java.lang.NullPointerException >> >> > at >> >> >> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64) >> >> > at >> >> >> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203) >> >> > at >> >> >> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197) >> >> > at >> >> >> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) >> >> > at >> >> >> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) >> >> >Caused by: java.lang.NullPointerException >> >> > at java.io.StringReader.<init>(StringReader.java:33) >> >> > at >> >> >> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71) >> >> > at >> >> >> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54) >> >> > at >> >> >> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187) >> >> > ... 9 more >> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 >> >> rollback >> >> >INFO: start rollback >> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 >> >> rollback >> >> >INFO: end_rollback >> >> > >> >> > >> >> >>On Mon, Jan 19, 2009 at 4:14 PM, Fergus McMenemie <fer...@twig.me.uk> >> >> wrote: >> >> >> >> >> >>> Hello all, >> >> >>> >> >> >>> I have the following DIH data-config.xml file. Adding >> >> >>> HTMLStripTransformer and the associated stripHTML on the >> >> >>> para tag seems to have broke things. I am using a nightly >> >> >>> build from 12-jan-2009 >> >> >>> >> >> >>> The /record/sect1/para contains HTML sub tags which need >> >> >>> to be discarded. Is my use of stripHTML correct? >> >> >>> >> >> >>> <dataConfig> >> >> >>> <dataSource name="myfilereader" type="FileDataSource"/> >> >> >>> <document> >> >> >>> <entity name="jcurrent" >> >> >>> processor="FileListEntityProcessor" >> >> >>> fileName=".*xml" >> >> >>> newerThan="'NOW-1000DAYS'" >> >> >>> recursive="true" >> >> >>> rootEntity="false" >> >> >>> dataSource="null" >> >> >>> baseDir="/Volumes/spare/ts/jxml/data/news/groups"> >> >> >>> >> >> >>> <entity name="x" >> >> >>> dataSource="myfilereader" >> >> >>> processor="XPathEntityProcessor" >> >> >>> url="${jcurrent.fileAbsolutePath}" >> >> >>> stream="false" >> >> >>> forEach="/record" >> >> >>> >> >> >>> >> >> >> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,HTMLStripTransformer"> >> >> >>> >> >> >>> <field column="fileAbsPath" >> >> >>> template="${jcurrent.fileAbsolutePath}" /> >> >> >>> <field column="fileWebPath" regex="/Volumes/spare/ts/(.*)" >> >> >>> replaceWith="$1" sourceColName="fileAbsePath"/> >> >> >>> <field column="title" xpath="/record/title" /> >> >> >>> <field column="para" xpath="/record/sect1/para" >> >> >>> stripHTML="true" /> >> >> >>> <field column="subject" >> >> >>> xpath="/record/metadata/subje...@qualifier='fullTitle']" /> >> >> >>> <field column="pubname" >> >> >>> xpath="/record/metadata/subje...@qualifier='publication']" /> >> >> >>> <field column="pubdate" >> >> >>> xpath="/record/metadata/da...@qualifier='pubDate']" >> >> >>> dateTimeFormat="yyyyMMdd" /> >> >> >>> </entity> >> >> >>> </entity> >> >> >>> </document> >> >> >>> </dataConfig> >> >> >>> >> >> >>> -- >> >> >>> -- =============================================================== Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===============================================================