hey XpathEntityprocessor does not work with wildcard xpath  like '//a...@class'

if you just wish to index htl use a PlaintextEntityProcessor with
HTMLStripTransformer

On Fri, Sep 11, 2009 at 1:22 AM, Daniel Cohen
<daniel.michael.co...@gmail.com> wrote:
> *HI there-**
> *
> *I'm trying to get the dataimporthandler working to recursively parse the
> content of a root directory, which contain several other directories beneath
> it... The indexing seems to encounter errors ith the doctype tag in my
> source files.*
> *
> *
> *i've provided my schema.xml with the appropriate fields,  I've added the
> dataimport requestHandler to the  solrconfig.xml. Does anyone know what I am
> doing wrong, or perhaps a better way to attempt this?*
> *
> *
> * dataconfig.xml:*
> <dataConfig>
> <dataSource type="FileDataSource" />
>    <document>
>        <entity name="file"
>  processor="FileListEntityProcessor"
> baseDir="exampledocs/dylan"
>  fileName=".*htm"
> recursive="true"
> rootEntity="false"
>  dataSource="null">
>  <entity name="song"
>  processor="XPathEntityProcessor"
> forEach="/html"
>  transformer="HTMLStripTransformer"
> url="${file.fileAbsolutePath}">
>                 <field column="name" xpath="//h...@class='songtitle']"/>
> <field column="album" xpath="//a...@class='recordlink']"/>
>  <field column="body" xpath="//body" stripHTML="true" />
>             </entity>
>        </entity>
>    </document>
> </dataConfig>
>
>
> *Stack trace:*
>
> ava.lang.RuntimeException: com.ctc.wstx.exc.WstxIOException: Server returned
> HTTP response code: 503 for URL:
> http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
>  at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
> at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:226)
>  at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:180)
> at
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:163)
>  at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:285)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:311)
>  at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:178)
> at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:136)
>  at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
> at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:386)
>  at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:377)
> Caused by: com.ctc.wstx.exc.WstxIOException: Server returned HTTP response
> code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
>  at com.ctc.wstx.sr.StreamScanner.throwFromIOE(StreamScanner.java:708)
> at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1086)
>  at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
> at
> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
>  at
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
> ... 10 more
> Caused by: java.io.IOException: Server returned HTTP response code: 503 for
> URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
>  at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1170)
> at java.net.URL.openStream(URL.java:1007)
>  at com.ctc.wstx.util.URLUtil.optimizedStreamFromURL(URLUtil.java:113)
> at
> com.ctc.wstx.io.DefaultInputResolver.sourceFromURL(DefaultInputResolver.java:256)
>  at
> com.ctc.wstx.io.DefaultInputResolver.resolveEntity(DefaultInputResolver.java:96)
> at
> com.ctc.wstx.sr.ValidatingStreamReader.findDtdExtSubset(ValidatingStreamReader.java:468)
>  at
> com.ctc.wstx.sr.ValidatingStreamReader.finishDTD(ValidatingStreamReader.java:358)
> at com.ctc.wstx.sr.BasicStreamReader.skipToken(BasicStreamReader.java:3351)
>  at
> com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:1988)
> at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1069)
>  ... 13 more
>
>
> *Sample .htm file:*
>
> *<?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd";>
> <html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml";>
>
> <head>
> <title>Hazel</title>
> <link rel="stylesheet" type="text/css" href="../css/general.css" />
> </head>
>
> <body>
>
> <h1 class="songtitle">Hazel</h1>
>
>
> <p>Words and music Bob Dylan<br />
> Released on <a class="recordlink" href="index.htm">Planet Waves</a>
> (1974)<br />
> Tabbed by Eyolf &Oslash;strem</p>
>
> <p>The song could equally well be played with C chords and a capo on the
> 4th fret. Such a version is appended at the end.</p>
>
> <p>The intro is played rather freely (which is a nice way of saying that
> they aren't exactly tight...) &ndash; and with both a bass and a guitar. The
> tab below is just a suggestion of an approximation.</p>
>
> <hr />
>
> <pre class="tab">
>  E                B                A  E/g# F#m E
> |--------7-------|--------2-------|----------------|
> |9-----9---9-----|4-----4---4-----|2---0-----------|
> |9---9-------9---|4---4-------4---|2---1---2---1---|
> |9---------------|4---------------|2---2---4---2---|
> |7---------------|2---------------|0-------4---2---|
> |----------------|----------------|----4---2---0---|
> </pre>
>
> <pre class="verse">
> E      G#
> Hazel, dirty-blonde hair
> A                           F#7
> I wouldn't be ashamed to be seen with you anywhere.
> E                   G#          C#m   E/b  A
> You got something I want plenty of
> E             B             A   G#m  F#m  E
> Ooh, a little touch of your love.
>
> Hazel, stardust in your eye
> You're goin' somewhere and so am I.
> I'd give you the sky high above
> Ooh, for a little touch of your love.
> </pre>
>
> <pre class="bridge">
> G#                        C#m
> Oh no, I don't need any reminder
> G#                        C#m
> To know how much I really care
> F#
> But it's just making me blinder and blinder
>            B       A        G#m              F#m
> Because I'm up on a hill and still you're not there.
> </pre>
>
> <pre class="verse">
> Hazel, you called and I came,
> Now don't make me play this waiting game.
> You've got something I want plenty of
> Ooh, a little touch of your love.
> </pre>
>
> <hr />
>
> <h2 class="songversion">Version with capo on 4th fret</h2>
> <pre class="verse">
> C      E
> Hazel, dirty-blonde hair
> F                           D7
> I wouldn't be ashamed to be seen with you anywhere.
> C                   E          Am   /g  F
> You got something I want plenty of
> C             G             F   Em  Dm  C
> Ooh, a little touch of your love.
>
> ...
> </pre>
>
> <pre class="bridge">
> E                         Am
> Oh no, I don't need any reminder
> E                         Am
> To know how much I really care
>
> But it's just making me blinder and blinder
>            G       F        Em               Dm
> Because I'm up on a hill and still you're not there.
> </pre>
> </body></html>
> *
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Reply via email to