[jira] [Updated] (SOLR-2597) XmlCharFilter

2011-06-16 Thread Mike Sokolov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated SOLR-2597:
---

Attachment: SOLR-2597.patch

Updated patch addresses (most of) Robert and Hoss' comments (thanks for the 
speedy review!):

Test now uses the random in the test framework

I added a test for the factory (actually all the tests now use the factory 
since it is now used to create the parser), but I haven't plumbed this all the 
way through to a schema declaration. 

Moved to org.apache.solr.analysis: I don't know if this is the right place for 
this, but at least it should resolve any jar and java 1.6 dependency problems - 
I think? - at least I can compile and run the tests from both eclipse and ant 
command line although I'm not sure what that proves exactly.

The parser is now created in the factory rather than being maintained as a 
static in the reader class.

 XmlCharFilter
 -

 Key: SOLR-2597
 URL: https://issues.apache.org/jira/browse/SOLR-2597
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 4.0
Reporter: Mike Sokolov
 Attachments: SOLR-2597.patch, SOLR-2597.patch


 This CharFilter processes incoming XML using the Woodstox parser, stripping 
 all non-text content and remembering offsets, just like HTMLCharFilter, but 
 respecting XML conventions like XML entities defined in a DTD.  XmlCharFilter 
 also provides the ability to exclude (and include) the content of certain 
 named elements.
 In order to compute character offsets properly when mixed line termination 
 styles are present (\r, \r\n), or when XML character entities (lt;, quot;, 
 amp;) are present, we require a newer version of Woodstox (4.1.1) than is 
 currently in solr/lib.  The earlier versions of the parser could not report 
 these entity events, so we couldn't tell the difference between  and 
 lt; and the offsets could be wrong.  The upgraded version is in the patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2597) XmlCharFilter

2011-06-15 Thread Mike Sokolov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated SOLR-2597:
---

Attachment: SOLR-2597.patch

I tried to include the upgraded Woodstox jars, but I don't think I figured how 
to put binaries in the patch actually.  What's needed are: 
http://repository.codehaus.org/org/codehaus/woodstox/woodstox-core-asl/4.1.1/woodstox-core-asl-4.1.1.jar
 and 
http://repository.codehaus.org/org/codehaus/woodstox/stax2-api/3.1.1/stax2-api-3.1.1.jar
which replace the existing wstx-asl-xxx.jar. 

 XmlCharFilter
 -

 Key: SOLR-2597
 URL: https://issues.apache.org/jira/browse/SOLR-2597
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 4.0
Reporter: Mike Sokolov
 Attachments: SOLR-2597.patch


 This CharFilter processes incoming XML using the Woodstox parser, stripping 
 all non-text content and remembering offsets, just like HTMLCharFilter, but 
 respecting XML conventions like XML entities defined in a DTD.  XmlCharFilter 
 also provides the ability to exclude (and include) the content of certain 
 named elements.
 In order to compute character offsets properly when mixed line termination 
 styles are present (\r, \r\n), or when XML character entities (lt;, quot;, 
 amp;) are present, we require a newer version of Woodstox (4.1.1) than is 
 currently in solr/lib.  The earlier versions of the parser could not report 
 these entity events, so we couldn't tell the difference between  and 
 lt; and the offsets could be wrong.  The upgraded version is in the patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org