[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Laurent Hoss (JIRA) Fri, 16 Jan 2009 09:39:31 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664579#action_12664579
 ]


Laurent Hoss commented on SOLR-380:
-----------------------------------

Hi Tricia
Looks nice, I've been searching for such a feature for years in lucene (and 
solr)!
But before getting too excited, i better try to ask the correct questions 
before doing a real test .. as we don't even use solr yet (though I really want 
to :) 

In fact we currently have our home grown solution for similar problem:
In our case we want to restrain boolean searches to paragraphs or sentences of 
a document, and implemented this (like many others) indexing extra docs for 
paragraphs etc. (with duplication of many meta-data fields of the parent 
document)
Besides multiplying index size, the mapping from the found paragraphs to their 
base documents involved a lot of custom coding.. and only recently we have at 
least implemented a fast counting of the base docs for the found paragraph 
docs, by using a 'baseDocId'-FieldCache  (essentialy a 'group by' In SQL lingo)

This leads to following requirements and questions:
* What is the performance of your PayloadComponent, compared to the standard 
SearchHandler?
We especially need very fast count(*) functionality, to dynamically compute 
statistics/charts needing 100's of queries.
For this we just need the hitsCount of documents/paragraphs without the xpath 
payload info, which would generate a really big XML response for 100K docs 
resultset!

* We want to find only documents where a (boolean) query matches within one of 
the paragraph_* fields, and not if the query matches over the combined content 
of multiple paragraphs, as discussed here:
http://www.nabble.com/Redundant-indexing-*-4-only-solution-(for-par-sen-and-case-sensitivity)-td13684315.html#a13685041
and
http://www.nabble.com/What-is-the-best-way-to-index-xml-data-preserving-the-mark-up--td13641104.html#a13657470
> The problem is that a search for sentence:foo AND sentence:bar is matching if 
> foo matches in any sentence of the paragraph, and bar also matches in any 
> sentence of the paragraph. 


Do you think this is a good option for us?
ps: We should probably put up some Wiki page for this topic, after I've seen at 
least 10 people asking for the possible solutions.. ok, maybe often with 
slightly different requirements!

One whole other way solving this would be using the SpanQuery package together 
with the nicelooking Qsol (http://myhardshadow.com/about.php), allthough I'm 
not sure about its performance especially with (really) long boolean queries!


> There's no way to convert search results into page-level hits of a 
> "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch, 
> xmlpayload-example.zip, xmlpayload-src.jar, xmlpayload.jar
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph 
> in Solr, there's no way to convert search results into page-level hits. The 
> solution: have a "paged-text" fieldtype which keeps track of page divisions 
> as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed 
> the tokens (using its standard tokenizers and filters), it would concurrently 
> build a structural map of the item, indicating which term position marked the 
> beginning of which page: <page id="234" firstterm="14324"/>. This map would 
> be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are 
> returned in the current request, and use the stored map to determine page ids 
> for each term position. The results would imitate the results for 
> highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int 
> name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Reply via email to