Re: Indexing books, chapters and pages

2016-03-02 Thread Zaccheo Bagnati
If someone of you cares about his Stackoverflow reputation and has time to do it I also opened a question there: http://stackoverflow.com/questions/35722672/solr-schema-to-model-books-chapters-and-pages. Thanks again to everybody Il giorno mer 2 mar 2016 alle ore 09:42 Zaccheo Bagnati

Re: Indexing books, chapters and pages

2016-03-02 Thread Zaccheo Bagnati
Thanks Alexandre, your solution seems very good: I'll surely try it and let you know. I like the Idea of mixing blockjoins and grouping! Il giorno mer 2 mar 2016 alle ore 04:46 Alexandre Rafalovitch < arafa...@gmail.com> ha scritto: > Here is an - untested - possible approach. I might be missing

Re: Indexing books, chapters and pages

2016-03-02 Thread Zaccheo Bagnati
Thanks Jack, the chapter is definitely the optimal unit to search into and your solution seems a quite good approach. The counterpart is that, depending on how we'll choose the amount of text shared on two adjacent pages we will experience some errors. For example will be always possible finding a

Re: Indexing books, chapters and pages

2016-03-02 Thread Zaccheo Bagnati
Thanks Emir, a similar solution has already come in my mind too: searching on chapters, highlighting the result and retrieve matching pages parsing the highlighted result... surely not a very efficient approach but could work... however I think I'll try different approaches before this Il giorno

Re: Indexing books, chapters and pages

2016-03-02 Thread Zaccheo Bagnati
Thanks Walter, the payload idea is something that I've never heard... it seems interesting but quite complex to implement. I think we'll have to write a custom filter to add page numbers and it's not clear to me how to retrieve payloads in the query result. However I'll try to go more in deep on

Re: Indexing books, chapters and pages

2016-03-01 Thread Alexandre Rafalovitch
Here is an - untested - possible approach. I might be missing something by combining these things in too many layers, but. 1) Have chapter as parent documents and pages as children within that. Block index them together. 2) On pages, include page text (probably not stored) as one field. Also

Re: Indexing books, chapters and pages

2016-03-01 Thread Jack Krupansky
The chapter seems like the optimal unit for initial searches - just combine the page text with a line break between them or index as a multivalued field and set the position increment gap to be 1 so that phrases work. You could have a separate collection for pages, with each page as a Solr

Re: Indexing books, chapters and pages

2016-03-01 Thread Emir Arnautovic
Hi, From the top of my head - probably does not solve problem completely, but may trigger brainstorming: Index chapters and include page break tokens. Use highlighting to return matches and make sure fragment size is large enough to get page break token. In such scenario you should use slop

Re: Indexing books, chapters and pages

2016-03-01 Thread Walter Underwood
You could index both pages and chapters, with a type field. You could index by chapter with the page number as a payload for each token. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 1, 2016, at 5:50 AM, Zaccheo Bagnati

Re: Indexing books, chapters and pages

2016-03-01 Thread Zaccheo Bagnati
Thank you, Jack for your answer. There are 2 reasons: 1. the requirement is to show in the result list both books and chapters grouped, so I would have to execute the query grouping by book, retrieve first, let's say, 10 books (sorted by relevance) and then for each book repeat the query grouping

Re: Indexing books, chapters and pages

2016-03-01 Thread Jack Krupansky
Any reason not to use the simplest structure - each page is one Solr document with a book field, a chapter field, and a page text field? You can then use grouping to group results by book (title text) or even chapter (title text and/or number). Maybe initially group by book and then if the user

Re: Indexing books, chapters and pages

2016-03-01 Thread Zaccheo Bagnati
Original data is quite well structured: it comes in XML with chapters and tags to mark the original page breaks on the paper version. In this way we have the possibility to restructure it almost as we want before creating SOLR index. Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <

Re: Indexing books, chapters and pages

2016-03-01 Thread Jack Krupansky
To start, what is the form of your input data - is it already divided into chapters and pages? Or... are you starting with raw PDF files? -- Jack Krupansky On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati wrote: > Hi all, > I'm searching for ideas on how to define schema

Re: Indexing books, chapters and pages

2016-03-01 Thread Zaccheo Bagnati
That's fine. But how could I get, for example, obtain a list of the pages containing a match? Il giorno mar 1 mar 2016 alle ore 13:01 Binoy Dalal ha scritto: > Here's one idea. > Index each chapter as a parent document and then have individual pages to > be the child

Re: Indexing books, chapters and pages

2016-03-01 Thread Binoy Dalal
Here's one idea. Index each chapter as a parent document and then have individual pages to be the child documents. That way for a match in any chapter, you also get the individual pages as documents for presentation. On Tue, 1 Mar 2016, 17:26 Zaccheo Bagnati, wrote: > Hi

Indexing books, chapters and pages

2016-03-01 Thread Zaccheo Bagnati
Hi all, I'm searching for ideas on how to define schema and how to perform queries in this use case: we have to index books, each book is split into chapters and chapters are split into pages (pages represent original page cutting in printed version). We should show the result grouped by books and