Shalin

Downloaded nightly for 21jan and tried DIH again. Its better but
still broken. Dozens of embeded tags are stripped from documents
but it now fails every few documents for no reason I can see. Manually
removing embeded tags causes a given problem document to be indexed,
only to have a it fail on one of the next few documents. I think the
problem is still in stripHTML

Here is the traceback.

Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start
INFO: Server startup in 3377 ms
Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties
INFO: Read dataimport.properties
Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute
INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import} 
status=0 QTime=13 
Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
INFO: Starting Full Import
Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX
Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=2
        
commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1]
        
commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2]
Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: last commit = 1232539612131
Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder 
buildDocument
SEVERE: Exception while processing: jc document : null
org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed 
for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 
Processing Document # 9
        at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
        at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
        at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
        at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
        at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
        at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
        at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
        at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
        at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
        at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
        at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
        at 
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
        at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
        ... 9 more
Caused by: java.util.NoSuchElementException
        at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083)
        at 
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
        at 
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
        at 
org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
        at 
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
        ... 10 more
Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed 
for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 
Processing Document # 9
        at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
        at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
        at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
        at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
        at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
        at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
        at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
        at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
        at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
        at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
        at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
        at 
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
        at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
        ... 9 more
Caused by: java.util.NoSuchElementException
        at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083)
        at 
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
        at 
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
        at 
org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
        at 
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
        ... 10 more
Jan 21, 2009 12:07:40 PM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback



>Ah, it needs a null check for multi valued fields. I've committed a fix to
>trunk. The next nightly build should have it. You can checkout and build
>from the trunk if need this immediately.
>
>On Mon, Jan 19, 2009 at 7:02 PM, Fergus McMenemie <fer...@twig.me.uk> wrote:
>
>> Hmmm,
>>
>> Just to clarify I retested the thing using the nightly as of today
>> 18-jan-2009. The problem is still there and this traceback is from
>> that nightly.
>>
>> >>This looks fine. Can you post the stack trace?
>> >>
>> >Yep, here is the juicy bit. Let me know if you need more.
>> >
>> >Jan 19, 2009 11:08:03 AM org.apache.catalina.startup.Catalina start
>> >INFO: Server startup in 2390 ms
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrCore execute
>> >INFO: [janesdocs] webapp=/solr path=/dataimport
>> params={command=full-import} status=0 QTime=12
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.SolrWriter
>> readIndexerProperties
>> >INFO: Read dataimport.properties
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter
>> doFullImport
>> >INFO: Starting Full Import
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
>> deleteAll
>> >INFO: [janesdocs] REMOVING ALL DOCUMENTS FROM INDEX
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy onInit
>> >INFO: SolrDeletionPolicy.onInit: commits:num=2
>> >
>> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1232363283058,generation=1,filenames=[segments_1]
>> >
>> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1232363283059,generation=2,filenames=[segments_2]
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy
>> updateCommits
>> >INFO: last commit = 1232363283059
>> >Jan 19, 2009 11:14:06 AM
>> org.apache.solr.handler.dataimport.EntityProcessorBase applyTransformer
>> >WARNING: transformer threw error
>> >java.lang.NullPointerException
>> >       at java.io.StringReader.<init>(StringReader.java:33)
>> >       at
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
>> >       at
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
>> >       at
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
>> >       at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
>> >       at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>> >       at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>> >       at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>> >       at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DocBuilder
>> buildDocument
>> >SEVERE: Exception while processing: janescurrent document : null
>> >org.apache.solr.handler.dataimport.DataImportHandlerException:
>> java.lang.NullPointerException
>> >       at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
>> >       at
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
>> >       at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
>> >       at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>> >       at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>> >       at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>> >       at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>> >Caused by: java.lang.NullPointerException
>> >       at java.io.StringReader.<init>(StringReader.java:33)
>> >       at
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
>> >       at
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
>> >       at
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
>> >       ... 9 more
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter
>> doFullImport
>> >SEVERE: Full Import failed
>> >org.apache.solr.handler.dataimport.DataImportHandlerException:
>> java.lang.NullPointerException
>> >       at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
>> >       at
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
>> >       at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
>> >       at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>> >       at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>> >       at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>> >       at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>> >       at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>> >Caused by: java.lang.NullPointerException
>> >       at java.io.StringReader.<init>(StringReader.java:33)
>> >       at
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
>> >       at
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
>> >       at
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
>> >       ... 9 more
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
>> rollback
>> >INFO: start rollback
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
>> rollback
>> >INFO: end_rollback
>> >
>> >
>> >>On Mon, Jan 19, 2009 at 4:14 PM, Fergus McMenemie <fer...@twig.me.uk>
>> wrote:
>> >>
>> >>> Hello all,
>> >>>
>> >>> I have the following DIH data-config.xml file. Adding
>> >>> HTMLStripTransformer and the associated stripHTML on the
>> >>> para tag seems to have broke things. I am using a nightly
>> >>> build from 12-jan-2009
>> >>>
>> >>> The /record/sect1/para contains HTML sub tags which need
>> >>> to be discarded. Is my use of stripHTML correct?
>> >>>
>> >>> <dataConfig>
>> >>>  <dataSource name="myfilereader" type="FileDataSource"/>
>> >>>  <document>
>> >>>     <entity name="jcurrent"
>> >>>        processor="FileListEntityProcessor"
>> >>>        fileName=".*xml"
>> >>>        newerThan="'NOW-1000DAYS'"
>> >>>        recursive="true"
>> >>>        rootEntity="false"
>> >>>        dataSource="null"
>> >>>        baseDir="/Volumes/spare/ts/jxml/data/news/groups">
>> >>>
>> >>>        <entity name="x"
>> >>>           dataSource="myfilereader"
>> >>>           processor="XPathEntityProcessor"
>> >>>           url="${jcurrent.fileAbsolutePath}"
>> >>>           stream="false"
>> >>>           forEach="/record"
>> >>>
>> >>>
>> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,HTMLStripTransformer">
>> >>>
>> >>>           <field column="fileAbsPath"
>> >>> template="${jcurrent.fileAbsolutePath}" />
>> >>>           <field column="fileWebPath" regex="/Volumes/spare/ts/(.*)"
>> >>> replaceWith="$1" sourceColName="fileAbsePath"/>
>> >>>           <field column="title"    xpath="/record/title" />
>> >>>           <field column="para"     xpath="/record/sect1/para"
>> >>> stripHTML="true" />
>> >>>           <field column="subject"
>> >>>  xpath="/record/metadata/subje...@qualifier='fullTitle']"   />
>> >>>           <field column="pubname"
>> >>>  xpath="/record/metadata/subje...@qualifier='publication']" />
>> >>>           <field column="pubdate"
>> >>>  xpath="/record/metadata/da...@qualifier='pubDate']"
>> >>> dateTimeFormat="yyyyMMdd"   />
>> >>>           </entity>
>> >>>        </entity>
>> >>>     </document>
>> >>>  </dataConfig>
>> >>>
>> >>> --
>> >>>
>-- 
>Regards,
>Shalin Shekhar Mangar.

-- 

===============================================================
Fergus McMenemie               Email:fer...@twig.me.uk
Techmore Ltd                   Phone:(UK) 07721 376021

Unix/Mac/Intranets             Analyst Programmer
===============================================================

Reply via email to