[
https://issues.apache.org/jira/browse/SOLR-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673581#action_12673581
]
Karl Wettin commented on SOLR-1020:
-----------------------------------
bq. Karl, would it make sense to use the NamedList format instead of a custom
XML one? That way, you can use most of the existing parsing code.
I don't know, would it?
bq. Thoughts?
The reason I choose JSR173 is that it allows for unmarshalling one token at the
time rather than all at once. I.e. I want to reuse the token instance in the
TokenStream the Analyzer produce rather than unmarshall all of the data at
once. My first thought was to parse the XML using a lexer but some simple tests
showed that the overhead of JSR173 was very small compared to jflex. I am
however considering jflex for the binary format.
I came up with this patch because I have a rather elaborate tokenization scheme
using ShingleMatrixFilter. The current solution of mine is to pass a base64
encoded serialized object as field value and use a custom Analyzer that produce
the TokenStream. However the tokenization is rather expensive (especially
during initial bulk import of my zillions of documents) so I'd rather do this
on my clients as I've got plenty of those but only one Solr.
> PreAnalyzed field analyzer
> --------------------------
>
> Key: SOLR-1020
> URL: https://issues.apache.org/jira/browse/SOLR-1020
> Project: Solr
> Issue Type: New Feature
> Components: Analysis
> Affects Versions: 1.3
> Reporter: Karl Wettin
> Priority: Minor
> Attachments: SOLR-1020.txt
>
>
> An Analyzer that produce a TokenStream based on XML input that contains a
> marshalled TokenStream. Also contains static TokenStream XML marshaller.
> I kind of pulled this out of my pocket without testing it in a real
> environment in order to get some comments on the solution before I add it to
> my project. So cosider it a beta-patch.
> It use JSR173 XMLStream API available in Java 1.6, compatible with Java 1.5
> and downloadable from https://sjsxp.dev.java.net/
> XSD:
> {code:xml}
> <?xml version="1.0" encoding="UTF-8"?>
> <xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified"
> xmlns:xs="http://www.w3.org/2001/XMLSchema">
> <xs:element name="tokens" type="tokensType"/>
> <xs:complexType name="tokensType">
> <xs:sequence>
> <xs:element type="tokenType" name="token"/>
> </xs:sequence>
> </xs:complexType>
> <xs:complexType name="tokenType">
> <xs:sequence>
> <xs:element type="xs:int" name="positionIncrement" maxOccurs="1"/>
> <xs:element type="xs:string" name="term" minOccurs="1"
> maxOccurs="1"/>
> <xs:element type="xs:string" name="type" maxOccurs="1"/>
> <xs:element type="xs:int" name="startOffset" maxOccurs="1"/>
> <xs:element type="xs:int" name="endOffset" maxOccurs="1"/>
> <xs:element type="xs:int" name="flags" maxOccurs="1"/>
> <xs:element type="payloadType" name="payload" maxOccurs="1"/>
> </xs:sequence>
> </xs:complexType>
> <xs:complexType name="payloadType">
> <xs:choice maxOccurs="1" minOccurs="1">
> <xs:element type="bytesType" name="bytes"/>
> <xs:element type="xs:string" name="hex"/>
> <xs:element type="xs:string" name="base64"/>
> </xs:choice>
> </xs:complexType>
> <xs:complexType name="bytesType">
> <xs:sequence>
> <xs:element type="xs:byte" name="byte" maxOccurs="unbounded"
> minOccurs="1"/>
> </xs:sequence>
> </xs:complexType>
> </xs:schema>
> {code}
> Even though I've added a couple of variants to how to handle a Payload in the
> XSD only <hex> is supported.
> Example XML:
> {code:xml}
> <tokens>
> <token>
> <positionIncrement>1</positionIncrement>
> <term>term</term>
> <type>type</type>
> <startOffset>0</startOffset>
> <endOffset>3</endOffset>
> <flags>65535</flags>
> <payload><hex>fffefd</hex></payload>
> </token>
> </tokens>
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.