If someone of you cares about his Stackoverflow reputation and has time to do it I also opened a question there: http://stackoverflow.com/questions/35722672/solr-schema-to-model-books-chapters-and-pages. Thanks again to everybody
Il giorno mer 2 mar 2016 alle ore 09:42 Zaccheo Bagnati <zacch...@gmail.com> ha scritto: > Thanks Alexandre, > your solution seems very good: I'll surely try it and let you know. I like > the Idea of mixing blockjoins and grouping! > > > Il giorno mer 2 mar 2016 alle ore 04:46 Alexandre Rafalovitch < > arafa...@gmail.com> ha scritto: > >> Here is an - untested - possible approach. I might be missing >> something by combining these things in too many layers, but..... >> >> 1) Have chapter as parent documents and pages as children within that. >> Block index them together. >> 2) On pages, include page text (probably not stored) as one field. >> Also include a second field that has last paragraph of that page as >> well as first paragraph of the next page. This gives you phrase >> matches across boundaries. Also include pageId, etc. >> 3) On chapters, include book id as a string field. >> 4) Use block join query to search against pages, but return (parent) >> chapters >> https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers >> 5) Use grouping or collapsing+expanding by book id to group chapters >> within a book: >> https://cwiki.apache.org/confluence/display/solr/Result+Grouping >> or >> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results >> 6) Use [child] DocumentTransformer to get pages back with childFilter >> to re-limit them by your query: >> >> https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents#TransformingResultDocuments-[child]-ChildDocTransformerFactory >> >> The main question is whether 6) will be able to piggyback on the >> output of 5)...... And, of course, the performance... >> >> I would love to know if this works, even partially. Either on the >> mailing list or directly. >> >> Regards, >> Alex. >> >> ---- >> Newsletter and resources for Solr beginners and intermediates: >> http://www.solr-start.com/ >> >> >> On 2 March 2016 at 00:50, Zaccheo Bagnati <zacch...@gmail.com> wrote: >> > Thank you, Jack for your answer. >> > There are 2 reasons: >> > 1. the requirement is to show in the result list both books and chapters >> > grouped, so I would have to execute the query grouping by book, retrieve >> > first, let's say, 10 books (sorted by relevance) and then for each book >> > repeat the query grouping by chapter (always ordering by relevance) in >> > order to obtain what we need (unfortunately it is not up to me defining >> the >> > requirements... but it however make sense). Unless there exist some SOLR >> > feature to do this in only one call (and that would be great!). >> > 2. searching on pages will not match phrases that spans across 2 pages >> > (e.g. if last word of page 1 is "broken" and first word of page 2 is >> > "sentence" searching for "broken sentence" will not match) >> > However if we will not find a better solution I think that your >> proposal is >> > not so bad... I hope that reason #2 could be negligible and that #1 >> > performs quite fast though we are multiplying queries. >> > >> > Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky < >> > jack.krupan...@gmail.com> ha scritto: >> > >> >> Any reason not to use the simplest structure - each page is one Solr >> >> document with a book field, a chapter field, and a page text field? >> You can >> >> then use grouping to group results by book (title text) or even chapter >> >> (title text and/or number). Maybe initially group by book and then if >> the >> >> user selects a book group you can re-query with the specific book and >> then >> >> group by chapter. >> >> >> >> >> >> -- Jack Krupansky >> >> >> >> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati <zacch...@gmail.com> >> >> wrote: >> >> >> >> > Original data is quite well structured: it comes in XML with >> chapters and >> >> > tags to mark the original page breaks on the paper version. In this >> way >> >> we >> >> > have the possibility to restructure it almost as we want before >> creating >> >> > SOLR index. >> >> > >> >> > Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky < >> >> > jack.krupan...@gmail.com> ha scritto: >> >> > >> >> > > To start, what is the form of your input data - is it already >> divided >> >> > into >> >> > > chapters and pages? Or... are you starting with raw PDF files? >> >> > > >> >> > > >> >> > > -- Jack Krupansky >> >> > > >> >> > > On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati < >> zacch...@gmail.com> >> >> > > wrote: >> >> > > >> >> > > > Hi all, >> >> > > > I'm searching for ideas on how to define schema and how to >> perform >> >> > > queries >> >> > > > in this use case: we have to index books, each book is split into >> >> > > chapters >> >> > > > and chapters are split into pages (pages represent original page >> >> > cutting >> >> > > in >> >> > > > printed version). We should show the result grouped by books and >> >> > chapters >> >> > > > (for the same book) and pages (for the same chapter). As far as I >> >> know, >> >> > > we >> >> > > > have 2 options: >> >> > > > >> >> > > > 1. index pages as SOLR documents. In this way we could >> theoretically >> >> > > > retrieve chapters (and books?) using grouping but >> >> > > > a. we will miss matches across two contiguous pages (page >> cutting >> >> > is >> >> > > > only due to typographical needs so concepts could be split... as >> in >> >> > > printed >> >> > > > books) >> >> > > > b. I don't know if it is possible in SOLR to group results >> on two >> >> > > > different levels (books and chapters) >> >> > > > >> >> > > > 2. index chapters as SOLR documents. In this case we will have >> the >> >> > right >> >> > > > matches but how to obtain the matching pages? (we need pages >> because >> >> > the >> >> > > > client can only display pages) >> >> > > > >> >> > > > we have been struggling on this problem for a lot of time and >> we're >> >> > not >> >> > > > able to find a suitable solution so I'm looking if someone has >> ideas >> >> or >> >> > > has >> >> > > > already solved a similar issue. >> >> > > > Thanks >> >> > > > >> >> > > >> >> > >> >> >> >