[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Tricia Williams (JIRA) Wed, 23 Apr 2008 16:49:00 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12591873#action_12591873
 ]


Tricia Williams commented on SOLR-380:
--------------------------------------

After a lengthy absence I've returned to this issue with a bit of a new 
perspective.  I recognize what we have described really is a customization of 
Solr (albeit one I have seen in at least two organizations) and as such should 
be built as a plug-in (http://wiki.apache.org/solr/SolrPlugins) which can 
reside in your solr.home lib directory.  Now that Solr has lucene 2.3 and 
payloads my solution is much easier to apply than before.

I'll try to explain it here and then attach the src, deployable jar, and 
example for your use/reuse.

I assume that your structured document can be represented by xml:

{code:xml}
<book title="One, Two, Three">
   <page label="1">one</page>
   <page label="2">two</page>
   <page label="3">three</page>
</book>
{code}
 
But we don't have a tokenizer that can make sense of xml.  So I wrote a 
tokenizer which parallels the existing WhitespaceTokenizer called 
XmlPayloadWhitespaceTokenizer.  XmlPayloadWhitespaceTokenizer extends 
XmlPayloadCharTokenizer which does the same things as CharTokenizer in Lucene, 
but expects that the content is wrapped in xml tags.  The tokenizer keeps track 
of the xpath associated with each token and stores this as a payload.  

To use my Tokenizer in Solr I add the deployable jar I created containing 
XmlPayloadWhitespaceTokenizer in my solr.home lib director and add a structure 
text field type "text_st" to my schema.xml:
{code:xml}
<!-- A text field that uses the XmlPayloadWhitespaceTokenizer to store xpath 
info about the structured document -->
  <fieldType name="text_st" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.XmlPayloadWhitespaceTokenizerFactory"/>
      <!-- in this example, we will only use synonyms at query time
      <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" 
ignoreCase="true" expand="false"/>
      -->
      <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
splitOnCaseChange="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPorterFilterFactory" 
protected="protwords.txt"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
splitOnCaseChange="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPorterFilterFactory" 
protected="protwords.txt"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
  </fieldType>
{code}

I also add a field "fulltext_st" of type "text_st".

We can visualize what happens to the input text above using the Solr Admin 
web-app analysis.jsp modified by 
[SOLR-522|https://issues.apache.org/jira/browse/SOLR-522].

|term position|1|2|3|
|term text|one|two|three|
|term type|word|word|word|
|source start,end|3,6|7,10|11,16|
|payload|/book[title='One, Two, Three']/page[label='1']|/book[title='One, Two, 
Three']/page[label='2']|/book[title='One, Two, Three']/page[label='3']|

~Note that I've removed the hex representation of the payload for clarity~

The other side of this problem is how to present the results in a meaningful 
way.  Taking FacetComponent and HighlightComponent as my muse, I created a 
plugable [SearchComponent|http://wiki.apache.org/solr/SearchComponent] called 
PayloadComponent.  This component recognizes two parameters: "payload" and 
"payload.fl".  If payload=true, the component will find the terms from your 
query in the payload.fl field, retrieve the payload in these tokens, and 
re-combine this information to display the xpath of a search result in a give 
document and the number of times that term occurs in the given xpath.  

Again, to use my SearchComponent in Solr I add the deployable jar I created 
containing PayloadComponent in my solr.home lib director and add a search 
component "payload" to my solrconfig.xml:

{code:xml}
<searchComponent name="payload" 
class="org.apache.solr.handler.component.PayloadComponent"/>
 
  <requestHandler name="/search" 
class="org.apache.solr.handler.component.SearchHandler">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
    </lst>
    <arr name="last-components">
      <str>payload</str>
    </arr>
  </requestHandler>
{code}

Then the result of 
http://localhost:8983/solr/search?q=came&payload=true&payload.fl=fulltext_st 
includes something like this:

{code:xml}
<lst name="payload">
 <lst name="payload_context">
  <lst name="Book.IA.0001">
   <lst name="fulltext_st">
    <int name="/book[title='Crooked 
Man'][url='http://ia310931.us.archive.org//load_djvu_applet.cgi?file=0/items/crookedmanotherr00newyiala/crookedmanotherr00newyiala.djvu'][author='unknown']/page[id='3']">1</int>
   </lst>
  </lst>
  <lst name="Book.IA.37729">
   <lst name="fulltext_st">
    <int name="/book[title='Charles Dicken's A Christmas 
Carol'][url=''][author='Dickens, Charles']/stave[title='Marley's 
Ghost'][id='One']/page[id='13']">1</int>
   </lst>
  </lst>
  <lst name="Book.IA.0002">
   <lst name="fulltext_st">
    <int name="/book[title='Jack and Jill and Old Dame 
Gill']/page[id='2']">1</int>
    <int name="/book[title='Jack and Jill and Old Dame 
Gill']/page[id='4']">1</int>
    <int name="/book[title='Jack and Jill and Old Dame 
Gill']/page[id='6']">1</int>
    <int name="/book[title='Jack and Jill and Old Dame 
Gill']/page[id='7']">1</int>
    <int name="/book[title='Jack and Jill and Old Dame 
Gill']/page[id='13']">1</int>
   </lst>
  </lst>
 </lst>
</lst>
{code}  

~The documents here are borrowed from the [Internet Archive|http://archive.org] 
and can be found in the xmlpayload-example.zip attached to this issue~

Then you have everything you need to write an xsl which will take your normal 
Solr results and supplement them with context from your structured document.

There may be some issues with filters that aren't payload aware.  The only one 
that concerned me to this point is the WordDelimiterFilter.  You can find a 
quick and easy patch at 
[SOLR-532|https://issues.apache.org/jira/browse/SOLR-532].

The other thing that you might run into if you use curl or post.jar is that the 
XmlUpdateRequestHandler is a bit anal about well formed xml, and throws an 
exception if it finds anything but the expected <doc> and <field> tags.  To 
work around either escape your structured document's xml like this:
{code:xml}
<add>
 <doc>
  <field name="id">0001</field>
  <field name="title">One, Two, Three</field>
  <field name="fulltext_st">
   &lt;book title="One, Two, Three"&gt;
    &lt;page label="1"&gt;one&lt;/page&gt;
    &lt;page label="2"&gt;two&lt;/page&gt;
    &lt;page label="3"&gt;three&lt;/page&gt;
   &lt;/book&gt;
  </field>
 </doc>
</add>
{code}
or hack XmlUpdateRequestHandler to accept your "unexpected XML tag doc/".

Cool?

> There's no way to convert search results into page-level hits of a 
> "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>         Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph 
> in Solr, there's no way to convert search results into page-level hits. The 
> solution: have a "paged-text" fieldtype which keeps track of page divisions 
> as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed 
> the tokens (using its standard tokenizers and filters), it would concurrently 
> build a structural map of the item, indicating which term position marked the 
> beginning of which page: <page id="234" firstterm="14324"/>. This map would 
> be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are 
> returned in the current request, and use the stored map to determine page ids 
> for each term position. The results would imitate the results for 
> highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int 
> name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Reply via email to