>Hi Fergus,
>
>It seems a field it is expecting is missing from the XML.

You mean there is some field in the document we are indexing
that is missing?

><field column="fileAbsPath" template="${jcurrent.fileAbsolutePath}" />
><field column="fileWebPath" regex="/Volumes/spare/ts/(.*)" replaceWith="$1"
>sourceColName="*fileAbsePath*"/>
>
>I guess "fileAbsePath" is a typo? Can you check if that is the cause?
Well spotted. I had made a mess of sanitizing the config file I sent
to you. I will in future make sure the stuff I am messing with matches
what I send to the list. However there is no typo in the underlying file;
at least not on that line:-) 


>
>
>On Wed, Jan 21, 2009 at 5:40 PM, Fergus McMenemie <fer...@twig.me.uk> wrote:
>
>> Shalin
>>
>> Downloaded nightly for 21jan and tried DIH again. Its better but
>> still broken. Dozens of embeded tags are stripped from documents
>> but it now fails every few documents for no reason I can see. Manually
>> removing embeded tags causes a given problem document to be indexed,
>> only to have a it fail on one of the next few documents. I think the
>> problem is still in stripHTML
>>
>> Here is the traceback.
>>
>> Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start
>> INFO: Server startup in 3377 ms
>> Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter
>> readIndexerProperties
>> INFO: Read dataimport.properties
>> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute
>> INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import}
>> status=0 QTime=13
>> Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter
>> doFullImport
>> INFO: Starting Full Import
>> Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2
>> deleteAll
>> INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX
>> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit
>> INFO: SolrDeletionPolicy.onInit: commits:num=2
>>
>>  
>> commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1]
>>
>>  
>> commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2]
>> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy
>> updateCommits
>> INFO: last commit = 1232539612131
>> Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder
>> buildDocument
>> SEVERE: Exception while processing: jc document : null
>> org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing
>> failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0
>> Processing Document # 9
>>        at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
>>         at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>>        at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>>        at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>>        at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>> Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
>>        at
>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
>>        ... 9 more
>> Caused by: java.util.NoSuchElementException
>>        at
>> com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083)
>>        at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
>>        at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
>>        at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
>>        at
>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
>>        ... 10 more
>> Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter
>> doFullImport
>> SEVERE: Full Import failed
>> org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing
>> failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0
>> Processing Document # 9
>>        at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
>>         at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>>        at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>>        at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>>        at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>> Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
>>        at
>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
>>        at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
>>        ... 9 more
>> Caused by: java.util.NoSuchElementException
>>        at
>> com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083)
>>        at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
>>        at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
>>        at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
>>        at
>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
>>        ... 10 more
>> Jan 21, 2009 12:07:40 PM org.apache.solr.update.DirectUpdateHandler2
>> rollback
>> INFO: start rollback
>>
>>
>>
>> >Ah, it needs a null check for multi valued fields. I've committed a fix to
>> >trunk. The next nightly build should have it. You can checkout and build
>> >from the trunk if need this immediately.
>> >
>> >On Mon, Jan 19, 2009 at 7:02 PM, Fergus McMenemie <fer...@twig.me.uk>
>> wrote:
>> >
>> >> Hmmm,
>> >>
>> >> Just to clarify I retested the thing using the nightly as of today
>> >> 18-jan-2009. The problem is still there and this traceback is from
>> >> that nightly.
>> >>
>> >> >>This looks fine. Can you post the stack trace?
>> >> >>
>> >> >Yep, here is the juicy bit. Let me know if you need more.
>> >> >
>> >> >Jan 19, 2009 11:08:03 AM org.apache.catalina.startup.Catalina start
>> >> >INFO: Server startup in 2390 ms
>> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrCore execute
>> >> >INFO: [janesdocs] webapp=/solr path=/dataimport
>> >> params={command=full-import} status=0 QTime=12
>> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.SolrWriter
>> >> readIndexerProperties
>> >> >INFO: Read dataimport.properties
>> >> >Jan 19, 2009 11:14:06 AM
>> org.apache.solr.handler.dataimport.DataImporter
>> >> doFullImport
>> >> >INFO: Starting Full Import
>> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
>> >> deleteAll
>> >> >INFO: [janesdocs] REMOVING ALL DOCUMENTS FROM INDEX
>> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy onInit
>> >> >INFO: SolrDeletionPolicy.onInit: commits:num=2
>> >> >
>> >>
>> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1232363283058,generation=1,filenames=[segments_1]
>> >> >
>> >>
>> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1232363283059,generation=2,filenames=[segments_2]
>> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy
>> >> updateCommits
>> >> >INFO: last commit = 1232363283059
>> >> >Jan 19, 2009 11:14:06 AM
>> >> org.apache.solr.handler.dataimport.EntityProcessorBase applyTransformer
>> >> >WARNING: transformer threw error
>> >> >java.lang.NullPointerException
>> >> >       at java.io.StringReader.<init>(StringReader.java:33)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DocBuilder
>> >> buildDocument
>> >> >SEVERE: Exception while processing: janescurrent document : null
>> >> >org.apache.solr.handler.dataimport.DataImportHandlerException:
>> >> java.lang.NullPointerException
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>> >> >Caused by: java.lang.NullPointerException
>> >> >       at java.io.StringReader.<init>(StringReader.java:33)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
>> >> >       ... 9 more
>> >> >Jan 19, 2009 11:14:06 AM
>> org.apache.solr.handler.dataimport.DataImporter
>> >> doFullImport
>> >> >SEVERE: Full Import failed
>> >> >org.apache.solr.handler.dataimport.DataImportHandlerException:
>> >> java.lang.NullPointerException
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>> >> >Caused by: java.lang.NullPointerException
>> >> >       at java.io.StringReader.<init>(StringReader.java:33)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
>> >> >       at
>> >>
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
>> >> >       ... 9 more
>> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
>> >> rollback
>> >> >INFO: start rollback
>> >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
>> >> rollback
>> >> >INFO: end_rollback
>> >> >
>> >> >
>> >> >>On Mon, Jan 19, 2009 at 4:14 PM, Fergus McMenemie <fer...@twig.me.uk>
>> >> wrote:
>> >> >>
>> >> >>> Hello all,
>> >> >>>
>> >> >>> I have the following DIH data-config.xml file. Adding
>> >> >>> HTMLStripTransformer and the associated stripHTML on the
>> >> >>> para tag seems to have broke things. I am using a nightly
>> >> >>> build from 12-jan-2009
>> >> >>>
>> >> >>> The /record/sect1/para contains HTML sub tags which need
>> >> >>> to be discarded. Is my use of stripHTML correct?
>> >> >>>
>> >> >>> <dataConfig>
>> >> >>>  <dataSource name="myfilereader" type="FileDataSource"/>
>> >> >>>  <document>
>> >> >>>     <entity name="jcurrent"
>> >> >>>        processor="FileListEntityProcessor"
>> >> >>>        fileName=".*xml"
>> >> >>>        newerThan="'NOW-1000DAYS'"
>> >> >>>        recursive="true"
>> >> >>>        rootEntity="false"
>> >> >>>        dataSource="null"
>> >> >>>        baseDir="/Volumes/spare/ts/jxml/data/news/groups">
>> >> >>>
>> >> >>>        <entity name="x"
>> >> >>>           dataSource="myfilereader"
>> >> >>>           processor="XPathEntityProcessor"
>> >> >>>           url="${jcurrent.fileAbsolutePath}"
>> >> >>>           stream="false"
>> >> >>>           forEach="/record"
>> >> >>>
>> >> >>>
>> >>
>> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,HTMLStripTransformer">
>> >> >>>
>> >> >>>           <field column="fileAbsPath"
>> >> >>> template="${jcurrent.fileAbsolutePath}" />
>> >> >>>           <field column="fileWebPath" regex="/Volumes/spare/ts/(.*)"
>> >> >>> replaceWith="$1" sourceColName="fileAbsePath"/>
>> >> >>>           <field column="title"    xpath="/record/title" />
>> >> >>>           <field column="para"     xpath="/record/sect1/para"
>> >> >>> stripHTML="true" />
>> >> >>>           <field column="subject"
>> >> >>>  xpath="/record/metadata/subje...@qualifier='fullTitle']"   />
>> >> >>>           <field column="pubname"
>> >> >>>  xpath="/record/metadata/subje...@qualifier='publication']" />
>> >> >>>           <field column="pubdate"
>> >> >>>  xpath="/record/metadata/da...@qualifier='pubDate']"
>> >> >>> dateTimeFormat="yyyyMMdd"   />
>> >> >>>           </entity>
>> >> >>>        </entity>
>> >> >>>     </document>
>> >> >>>  </dataConfig>
>> >> >>>
>> >> >>> --
>> >> >>>
-- 

===============================================================
Fergus McMenemie               Email:fer...@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Reply via email to