hey XpathEntityprocessor does not work with wildcard xpath like '//a...@class'
if you just wish to index htl use a PlaintextEntityProcessor with HTMLStripTransformer On Fri, Sep 11, 2009 at 1:22 AM, Daniel Cohen <daniel.michael.co...@gmail.com> wrote: > *HI there-** > * > *I'm trying to get the dataimporthandler working to recursively parse the > content of a root directory, which contain several other directories beneath > it... The indexing seems to encounter errors ith the doctype tag in my > source files.* > * > * > *i've provided my schema.xml with the appropriate fields, I've added the > dataimport requestHandler to the solrconfig.xml. Does anyone know what I am > doing wrong, or perhaps a better way to attempt this?* > * > * > * dataconfig.xml:* > <dataConfig> > <dataSource type="FileDataSource" /> > <document> > <entity name="file" > processor="FileListEntityProcessor" > baseDir="exampledocs/dylan" > fileName=".*htm" > recursive="true" > rootEntity="false" > dataSource="null"> > <entity name="song" > processor="XPathEntityProcessor" > forEach="/html" > transformer="HTMLStripTransformer" > url="${file.fileAbsolutePath}"> > <field column="name" xpath="//h...@class='songtitle']"/> > <field column="album" xpath="//a...@class='recordlink']"/> > <field column="body" xpath="//body" stripHTML="true" /> > </entity> > </entity> > </document> > </dataConfig> > > > *Stack trace:* > > ava.lang.RuntimeException: com.ctc.wstx.exc.WstxIOException: Server returned > HTTP response code: 503 for URL: > http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd > at > org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:226) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:180) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:163) > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:285) > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:311) > at > org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:178) > at > org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:136) > at > org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334) > at > org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:386) > at > org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:377) > Caused by: com.ctc.wstx.exc.WstxIOException: Server returned HTTP response > code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd > at com.ctc.wstx.sr.StreamScanner.throwFromIOE(StreamScanner.java:708) > at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1086) > at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141) > at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89) > at > org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82) > ... 10 more > Caused by: java.io.IOException: Server returned HTTP response code: 503 for > URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1170) > at java.net.URL.openStream(URL.java:1007) > at com.ctc.wstx.util.URLUtil.optimizedStreamFromURL(URLUtil.java:113) > at > com.ctc.wstx.io.DefaultInputResolver.sourceFromURL(DefaultInputResolver.java:256) > at > com.ctc.wstx.io.DefaultInputResolver.resolveEntity(DefaultInputResolver.java:96) > at > com.ctc.wstx.sr.ValidatingStreamReader.findDtdExtSubset(ValidatingStreamReader.java:468) > at > com.ctc.wstx.sr.ValidatingStreamReader.finishDTD(ValidatingStreamReader.java:358) > at com.ctc.wstx.sr.BasicStreamReader.skipToken(BasicStreamReader.java:3351) > at > com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:1988) > at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1069) > ... 13 more > > > *Sample .htm file:* > > *<?xml version="1.0" encoding="UTF-8"?> > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> > <html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"> > > <head> > <title>Hazel</title> > <link rel="stylesheet" type="text/css" href="../css/general.css" /> > </head> > > <body> > > <h1 class="songtitle">Hazel</h1> > > > <p>Words and music Bob Dylan<br /> > Released on <a class="recordlink" href="index.htm">Planet Waves</a> > (1974)<br /> > Tabbed by Eyolf Østrem</p> > > <p>The song could equally well be played with C chords and a capo on the > 4th fret. Such a version is appended at the end.</p> > > <p>The intro is played rather freely (which is a nice way of saying that > they aren't exactly tight...) – and with both a bass and a guitar. The > tab below is just a suggestion of an approximation.</p> > > <hr /> > > <pre class="tab"> > E B A E/g# F#m E > |--------7-------|--------2-------|----------------| > |9-----9---9-----|4-----4---4-----|2---0-----------| > |9---9-------9---|4---4-------4---|2---1---2---1---| > |9---------------|4---------------|2---2---4---2---| > |7---------------|2---------------|0-------4---2---| > |----------------|----------------|----4---2---0---| > </pre> > > <pre class="verse"> > E G# > Hazel, dirty-blonde hair > A F#7 > I wouldn't be ashamed to be seen with you anywhere. > E G# C#m E/b A > You got something I want plenty of > E B A G#m F#m E > Ooh, a little touch of your love. > > Hazel, stardust in your eye > You're goin' somewhere and so am I. > I'd give you the sky high above > Ooh, for a little touch of your love. > </pre> > > <pre class="bridge"> > G# C#m > Oh no, I don't need any reminder > G# C#m > To know how much I really care > F# > But it's just making me blinder and blinder > B A G#m F#m > Because I'm up on a hill and still you're not there. > </pre> > > <pre class="verse"> > Hazel, you called and I came, > Now don't make me play this waiting game. > You've got something I want plenty of > Ooh, a little touch of your love. > </pre> > > <hr /> > > <h2 class="songversion">Version with capo on 4th fret</h2> > <pre class="verse"> > C E > Hazel, dirty-blonde hair > F D7 > I wouldn't be ashamed to be seen with you anywhere. > C E Am /g F > You got something I want plenty of > C G F Em Dm C > Ooh, a little touch of your love. > > ... > </pre> > > <pre class="bridge"> > E Am > Oh no, I don't need any reminder > E Am > To know how much I really care > > But it's just making me blinder and blinder > G F Em Dm > Because I'm up on a hill and still you're not there. > </pre> > </body></html> > * > -- ----------------------------------------------------- Noble Paul | Principal Engineer| AOL | http://aol.com