Hi All, The Structured (or Multi-Page, Multi-Part) document problem is a problem I've been thinking about for a while. A couple of years ago when the project I was working on was using Lucene only (no Solr), we solved this problem in several steps. At the point of ingestion we created a custom analyzer and surrounding Java code that created a mapping for positions to which page it is on (recall that analyzers tokenize the terms in a given field and mark the position of the token). This mapping was stored outside of the Lucene index. At query time, we used home built Java to pull the position hits matching the query from the index and augmented the results generated by Lucene. At presentation time the results were molded into xml and then transformed by several xsl sheets, one of which translated the position hits to the page they were on using the information gleamed from the ingestion stage.
When we moved to Solr, we created a custom QueryResponseWriter in order to get the position locations into the xml results and kept the same transformation to obtain the page level hits. The ingestion stage stays the same -- so really we're using Lucene to build the index, but Solr sits on top of it to serve results. I admit this is an awkward hack. Peter Binkley ([EMAIL PROTECTED]) who I worked with on the project made this suggested improvement: > > "Paged-Text" FieldType for Solr > > A chance to dig into the guts of Solr. The problem: If we index a > monograph in Solr, there's no way to convert search results into > page-level hits. The solution: have a "paged-text" fieldtype which keeps > track of page divisions as it indexes, and reports page-level hits in the > search results. > > The input would contain page milestones: <page id="234"/>. As Solr > processed the tokens (using its standard tokenizers and filters), it would > concurrently build a structural map of the item, indicating which term > position marked the beginning of which page: <page id="234" > firstterm="14324"/>. This map would be stored in an unindexed field in > some efficient format. > > At search time, Solr would retrieve term positions for all hits that are > returned in the current request, and use the stored map to determine page > ids for each term position. The results would imitate the results for > highlighting, something like: > > <lst name="pages"> > <lst name="doc1"> > <int name="pageid">234</int> > <int name="pageid">236</int> > </lst> > <lst name="doc2"> > <int name="pageid">19</int> > </lst> > </lst> > <lst name="hitpos"> > <lst name="doc1"> > <lst name="234"> > <int name="pos">14325</int> > </lst> > </lst> > ... > </lst> > > We have some code that does something like this in a Lucene context, which > could form the basis for a Solr fieldtype; but it would probably be just > as easy to start fresh. > > My current project would like to have some meta data about each sub-part of the document also included. For example: each page would have a url, and/or a title associated with the content. This becomes meaningful when we index things like newspapers and monographs which may have page, chapter, or section level content. So a solution would ideally have taken this into consideration. Does anyone with more experience know if this is a reasonable approach? Does an issue exist for this feature request? Other comments or questions? Thanks, Tricia Pierre-Yves LANDRON wrote: > > Hello,Is it possible to structure lucene documents via Solr, so one > document coud fit into another one ?What I would like to do, for example > :I want to retrieve full text articles, that fit on several pages for each > of them. Results must take in account both the pages and the article from > wich the search terms are from. I can create a lucene document for each > pages of the article AND the article itself, and do two requests to get my > results, but it would duplicate the full text in the index, and will not > be too efficient. Ideally, what I would like to do is to create a document > for indexing the text of each pages of the article, and group these > documents in one document that describe the article : this way, when > Lucene retrieve a requested term, i'll get the article and the page that > contains the term.I wonder if there's a way to emulate elegantly this > behavior with Solr ?Kind Regards,Pierre-Yves Landron > -- View this message in context: http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a13185053 Sent from the Solr - User mailing list archive at Nabble.com.