Re: Structured Lucene documents

pgwillia Fri, 12 Oct 2007 17:00:37 -0700

Hi All,

The Structured (or Multi-Page, Multi-Part) document problem is a problem
I've been thinking about for a while.  A couple of years ago when the
project I was working on was using Lucene only (no Solr), we solved this
problem in several steps.  At the point of ingestion we created a custom
analyzer and surrounding Java code that created a mapping for positions to
which page it is on (recall that analyzers tokenize the terms in a given
field and mark the position of the token).  This mapping was stored outside
of the Lucene index.  At query time, we used home built Java to pull the
position hits matching the query from the index and augmented the results
generated by Lucene.  At presentation time the results were molded into xml
and then transformed by several xsl sheets, one of which translated the
position hits to the page they were on using the information gleamed from
the ingestion stage.


When we moved to Solr, we created a custom QueryResponseWriter in order to
get the position locations into the xml results and kept the same
transformation to obtain the page level hits.  The ingestion stage stays the
same -- so really we're using Lucene to build the index, but Solr sits on
top of it to serve results.

I admit this is an awkward hack.  Peter Binkley ([EMAIL PROTECTED])
who I worked with on the project made this suggested improvement:



> 
> "Paged-Text" FieldType for Solr
> 
> A chance to dig into the guts of Solr. The problem: If we index a
> monograph in Solr, there's no way to convert search results into
> page-level hits. The solution: have a "paged-text" fieldtype which keeps
> track of page divisions as it indexes, and reports page-level hits in the
> search results.
> 
> The input would contain page milestones: <page id="234"/>. As Solr
> processed the tokens (using its standard tokenizers and filters), it would
> concurrently build a structural map of the item, indicating which term
> position marked the beginning of which page: <page id="234"
> firstterm="14324"/>. This map would be stored in an unindexed field in
> some efficient format.
> 
> At search time, Solr would retrieve term positions for all hits that are
> returned in the current request, and use the stored map to determine page
> ids for each term position. The results would imitate the results for
> highlighting, something like:
> 
> <lst name="pages">
>         <lst name="doc1">
>                 <int name="pageid">234</int>
>                 <int name="pageid">236</int>
>         </lst>
>         <lst name="doc2">
>                 <int name="pageid">19</int>
>         </lst>
> </lst>
> <lst name="hitpos">
>         <lst name="doc1">
>                 <lst name="234">
>                         <int name="pos">14325</int>
>                 </lst>
>         </lst>
>         ...
> </lst>
> 
> We have some code that does something like this in a Lucene context, which
> could form the basis for a Solr fieldtype; but it would probably be just
> as easy to start fresh.
> 
> 

My current project would like to have some meta data about each sub-part of
the document also included.  For example: each page would have a url, and/or
a title associated with the content.  This becomes  meaningful when we index
things like newspapers and monographs which may have page, chapter, or
section level content.    So a solution would ideally have taken this into
consideration.
 
Does anyone with more experience know if this is a reasonable approach? 
Does an issue exist for this feature request?  Other comments or questions?

Thanks,
Tricia


Pierre-Yves LANDRON wrote:
> 
> Hello,Is it possible to structure lucene documents via Solr, so one
> document coud fit into another one ?What I would like to do, for example
> :I want to retrieve full text articles, that fit on several pages for each
> of them. Results must take in account both the pages and the article from
> wich the search terms are from. I can create a lucene document for each
> pages of the article AND the article itself, and do two requests to get my
> results, but it would duplicate the full text in the index, and will not
> be too efficient. Ideally, what I would like to do is to create a document
> for indexing the text of each pages of the article, and group these
> documents in one document that describe the article : this way, when
> Lucene retrieve a requested term, i'll get the article and the page that
> contains the term.I wonder if there's a way to emulate elegantly this
> behavior with Solr ?Kind Regards,Pierre-Yves Landron
> 

-- 
View this message in context: 
http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a13185053
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Structured Lucene documents

Reply via email to