[ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535768 ]
Mike Klaas commented on SOLR-380: --------------------------------- In my opinion the best solution is to create one solr document per page and denormalize the container data across each page. If I had to implement it the other way, I would probably index the pages as a multivalued field with a large position increment gap (say 1000), store term vectors, and use the position information from the term vectors to determine the page hits (e.g., pos 4668 -> page 5; pos 668 -> page 1; pos 9999 -> page 10). Assumes < 1000 tokens per page, of course. Incidentally, this discussion doesn't really belong here. It would be better to sketch out ideas on solr-user, then move to JIRA to track a resulting patch (if it gets that far). I actually don't think that there is anything to add to Solr here--it seems more of a question of customization. > There's no way to convert search results into page-level hits of a > "structured document". > ----------------------------------------------------------------------------------------- > > Key: SOLR-380 > URL: https://issues.apache.org/jira/browse/SOLR-380 > Project: Solr > Issue Type: New Feature > Components: search > Reporter: Tricia Williams > Priority: Minor > > "Paged-Text" FieldType for Solr > A chance to dig into the guts of Solr. The problem: If we index a monograph > in Solr, there's no way to convert search results into page-level hits. The > solution: have a "paged-text" fieldtype which keeps track of page divisions > as it indexes, and reports page-level hits in the search results. > The input would contain page milestones: <page id="234"/>. As Solr processed > the tokens (using its standard tokenizers and filters), it would concurrently > build a structural map of the item, indicating which term position marked the > beginning of which page: <page id="234" firstterm="14324"/>. This map would > be stored in an unindexed field in some efficient format. > At search time, Solr would retrieve term positions for all hits that are > returned in the current request, and use the stored map to determine page ids > for each term position. The results would imitate the results for > highlighting, something like: > <lst name="pages"> > <lst name="doc1"> > <int name="pageid">234</int> > <int name="pageid">236</int> > </lst> > <lst name="doc2"> > <int name="pageid">19</int> > </lst> > </lst> > <lst name="hitpos"> > <lst name="doc1"> > <lst name="234"> > <int > name="pos">14325</int> > </lst> > </lst> > ... > </lst> -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.