Shalin Downloaded nightly for 21jan and tried DIH again. Its better but still broken. Dozens of embeded tags are stripped from documents but it now fails every few documents for no reason I can see. Manually removing embeded tags causes a given problem document to be indexed, only to have a it fail on one of the next few documents. I think the problem is still in stripHTML
Here is the traceback. Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start INFO: Server startup in 3377 ms Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=13 Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=2 commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1] commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2] Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1232539612131 Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument SEVERE: Exception while processing: jc document : null org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 Processing Document # 9 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) Caused by: java.lang.RuntimeException: java.util.NoSuchElementException at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242) ... 9 more Caused by: java.util.NoSuchElementException at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89) at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82) ... 10 more Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter doFullImport SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 Processing Document # 9 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) Caused by: java.lang.RuntimeException: java.util.NoSuchElementException at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242) ... 9 more Caused by: java.util.NoSuchElementException at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89) at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82) ... 10 more Jan 21, 2009 12:07:40 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback >Ah, it needs a null check for multi valued fields. I've committed a fix to >trunk. The next nightly build should have it. You can checkout and build >from the trunk if need this immediately. > >On Mon, Jan 19, 2009 at 7:02 PM, Fergus McMenemie <fer...@twig.me.uk> wrote: > >> Hmmm, >> >> Just to clarify I retested the thing using the nightly as of today >> 18-jan-2009. The problem is still there and this traceback is from >> that nightly. >> >> >>This looks fine. Can you post the stack trace? >> >> >> >Yep, here is the juicy bit. Let me know if you need more. >> > >> >Jan 19, 2009 11:08:03 AM org.apache.catalina.startup.Catalina start >> >INFO: Server startup in 2390 ms >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrCore execute >> >INFO: [janesdocs] webapp=/solr path=/dataimport >> params={command=full-import} status=0 QTime=12 >> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.SolrWriter >> readIndexerProperties >> >INFO: Read dataimport.properties >> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter >> doFullImport >> >INFO: Starting Full Import >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 >> deleteAll >> >INFO: [janesdocs] REMOVING ALL DOCUMENTS FROM INDEX >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy onInit >> >INFO: SolrDeletionPolicy.onInit: commits:num=2 >> > >> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1232363283058,generation=1,filenames=[segments_1] >> > >> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1232363283059,generation=2,filenames=[segments_2] >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy >> updateCommits >> >INFO: last commit = 1232363283059 >> >Jan 19, 2009 11:14:06 AM >> org.apache.solr.handler.dataimport.EntityProcessorBase applyTransformer >> >WARNING: transformer threw error >> >java.lang.NullPointerException >> > at java.io.StringReader.<init>(StringReader.java:33) >> > at >> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71) >> > at >> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54) >> > at >> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187) >> > at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197) >> > at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) >> > at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) >> > at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) >> > at >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) >> > at >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) >> > at >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) >> > at >> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) >> > at >> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) >> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DocBuilder >> buildDocument >> >SEVERE: Exception while processing: janescurrent document : null >> >org.apache.solr.handler.dataimport.DataImportHandlerException: >> java.lang.NullPointerException >> > at >> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64) >> > at >> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203) >> > at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197) >> > at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) >> > at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) >> > at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) >> > at >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) >> > at >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) >> > at >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) >> > at >> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) >> > at >> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) >> >Caused by: java.lang.NullPointerException >> > at java.io.StringReader.<init>(StringReader.java:33) >> > at >> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71) >> > at >> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54) >> > at >> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187) >> > ... 9 more >> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter >> doFullImport >> >SEVERE: Full Import failed >> >org.apache.solr.handler.dataimport.DataImportHandlerException: >> java.lang.NullPointerException >> > at >> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64) >> > at >> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203) >> > at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197) >> > at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) >> > at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) >> > at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) >> > at >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) >> > at >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) >> > at >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) >> > at >> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) >> > at >> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) >> >Caused by: java.lang.NullPointerException >> > at java.io.StringReader.<init>(StringReader.java:33) >> > at >> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71) >> > at >> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54) >> > at >> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187) >> > ... 9 more >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 >> rollback >> >INFO: start rollback >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 >> rollback >> >INFO: end_rollback >> > >> > >> >>On Mon, Jan 19, 2009 at 4:14 PM, Fergus McMenemie <fer...@twig.me.uk> >> wrote: >> >> >> >>> Hello all, >> >>> >> >>> I have the following DIH data-config.xml file. Adding >> >>> HTMLStripTransformer and the associated stripHTML on the >> >>> para tag seems to have broke things. I am using a nightly >> >>> build from 12-jan-2009 >> >>> >> >>> The /record/sect1/para contains HTML sub tags which need >> >>> to be discarded. Is my use of stripHTML correct? >> >>> >> >>> <dataConfig> >> >>> <dataSource name="myfilereader" type="FileDataSource"/> >> >>> <document> >> >>> <entity name="jcurrent" >> >>> processor="FileListEntityProcessor" >> >>> fileName=".*xml" >> >>> newerThan="'NOW-1000DAYS'" >> >>> recursive="true" >> >>> rootEntity="false" >> >>> dataSource="null" >> >>> baseDir="/Volumes/spare/ts/jxml/data/news/groups"> >> >>> >> >>> <entity name="x" >> >>> dataSource="myfilereader" >> >>> processor="XPathEntityProcessor" >> >>> url="${jcurrent.fileAbsolutePath}" >> >>> stream="false" >> >>> forEach="/record" >> >>> >> >>> >> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,HTMLStripTransformer"> >> >>> >> >>> <field column="fileAbsPath" >> >>> template="${jcurrent.fileAbsolutePath}" /> >> >>> <field column="fileWebPath" regex="/Volumes/spare/ts/(.*)" >> >>> replaceWith="$1" sourceColName="fileAbsePath"/> >> >>> <field column="title" xpath="/record/title" /> >> >>> <field column="para" xpath="/record/sect1/para" >> >>> stripHTML="true" /> >> >>> <field column="subject" >> >>> xpath="/record/metadata/subje...@qualifier='fullTitle']" /> >> >>> <field column="pubname" >> >>> xpath="/record/metadata/subje...@qualifier='publication']" /> >> >>> <field column="pubdate" >> >>> xpath="/record/metadata/da...@qualifier='pubDate']" >> >>> dateTimeFormat="yyyyMMdd" /> >> >>> </entity> >> >>> </entity> >> >>> </document> >> >>> </dataConfig> >> >>> >> >>> -- >> >>> >-- >Regards, >Shalin Shekhar Mangar. -- =============================================================== Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===============================================================