Re: Indexing books, chapters and pages

2016-03-02 Thread Zaccheo Bagnati
If someone of you cares about his Stackoverflow reputation and has time to
do it I also opened a question there:
http://stackoverflow.com/questions/35722672/solr-schema-to-model-books-chapters-and-pages.
Thanks again to everybody

Il giorno mer 2 mar 2016 alle ore 09:42 Zaccheo Bagnati 
ha scritto:

> Thanks Alexandre,
> your solution seems very good: I'll surely try it and let you know. I like
> the Idea of mixing blockjoins and grouping!
>
>
> Il giorno mer 2 mar 2016 alle ore 04:46 Alexandre Rafalovitch <
> arafa...@gmail.com> ha scritto:
>
>> Here is an - untested - possible approach. I might be missing
>> something by combining these things in too many layers, but.
>>
>> 1) Have chapter as parent documents and pages as children within that.
>> Block index them together.
>> 2) On pages, include page text (probably not stored) as one field.
>> Also include a second field that has last paragraph of that page as
>> well as first paragraph of the next page. This gives you phrase
>> matches across boundaries. Also include pageId, etc.
>> 3) On chapters, include book id as a string field.
>> 4) Use block join query to search against pages, but return (parent)
>> chapters
>> https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers
>> 5) Use grouping or collapsing+expanding by book id to group chapters
>> within a book:
>> https://cwiki.apache.org/confluence/display/solr/Result+Grouping
>> or
>> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
>> 6) Use [child] DocumentTransformer to get pages back with childFilter
>> to re-limit them by your query:
>>
>> https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents#TransformingResultDocuments-[child]-ChildDocTransformerFactory
>>
>> The main question is whether 6) will be able to piggyback on the
>> output of 5).. And, of course, the performance...
>>
>> I would love to know if this works, even partially. Either on the
>> mailing list or directly.
>>
>> Regards,
>>Alex.
>>
>> 
>> Newsletter and resources for Solr beginners and intermediates:
>> http://www.solr-start.com/
>>
>>
>> On 2 March 2016 at 00:50, Zaccheo Bagnati  wrote:
>> > Thank you, Jack for your answer.
>> > There are 2 reasons:
>> > 1. the requirement is to show in the result list both books and chapters
>> > grouped, so I would have to execute the query grouping by book, retrieve
>> > first, let's say, 10 books (sorted by relevance) and then for each book
>> > repeat the query grouping by chapter (always ordering by relevance) in
>> > order to obtain what we need (unfortunately it is not up to me defining
>> the
>> > requirements... but it however make sense). Unless there exist some SOLR
>> > feature to do this in only one call (and that would be great!).
>> > 2. searching on pages will not match phrases that spans across 2 pages
>> > (e.g. if last word of page 1 is "broken" and first word of page 2 is
>> > "sentence" searching for "broken sentence" will not match)
>> > However if we will not find a better solution I think that your
>> proposal is
>> > not so bad... I hope that reason #2 could be negligible and that #1
>> > performs quite fast though we are multiplying queries.
>> >
>> > Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky <
>> > jack.krupan...@gmail.com> ha scritto:
>> >
>> >> Any reason not to use the simplest structure - each page is one Solr
>> >> document with a book field, a chapter field, and a page text field?
>> You can
>> >> then use grouping to group results by book (title text) or even chapter
>> >> (title text and/or number). Maybe initially group by book and then if
>> the
>> >> user selects a book group you can re-query with the specific book and
>> then
>> >> group by chapter.
>> >>
>> >>
>> >> -- Jack Krupansky
>> >>
>> >> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati 
>> >> wrote:
>> >>
>> >> > Original data is quite well structured: it comes in XML with
>> chapters and
>> >> > tags to mark the original page breaks on the paper version. In this
>> way
>> >> we
>> >> > have the possibility to restructure it almost as we want before
>> creating
>> >> > SOLR index.
>> >> >
>> >> > Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
>> >> > jack.krupan...@gmail.com> ha scritto:
>> >> >
>> >> > > To start, what is the form of your input data - is it already
>> divided
>> >> > into
>> >> > > chapters and pages? Or... are you starting with raw PDF files?
>> >> > >
>> >> > >
>> >> > > -- Jack Krupansky
>> >> > >
>> >> > > On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati <
>> zacch...@gmail.com>
>> >> > > wrote:
>> >> > >
>> >> > > > Hi all,
>> >> > > > I'm searching for ideas on how to define schema and how to
>> perform
>> >> > > queries
>> >> > > > in this use case: we have to index books, each book is split into
>> >> > > chapters
>> >> > > > and chapters are split into pages (pages represent 

Re: Indexing books, chapters and pages

2016-03-02 Thread Zaccheo Bagnati
Thanks Alexandre,
your solution seems very good: I'll surely try it and let you know. I like
the Idea of mixing blockjoins and grouping!

Il giorno mer 2 mar 2016 alle ore 04:46 Alexandre Rafalovitch <
arafa...@gmail.com> ha scritto:

> Here is an - untested - possible approach. I might be missing
> something by combining these things in too many layers, but.
>
> 1) Have chapter as parent documents and pages as children within that.
> Block index them together.
> 2) On pages, include page text (probably not stored) as one field.
> Also include a second field that has last paragraph of that page as
> well as first paragraph of the next page. This gives you phrase
> matches across boundaries. Also include pageId, etc.
> 3) On chapters, include book id as a string field.
> 4) Use block join query to search against pages, but return (parent)
> chapters
> https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers
> 5) Use grouping or collapsing+expanding by book id to group chapters
> within a book:
> https://cwiki.apache.org/confluence/display/solr/Result+Grouping
> or
> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
> 6) Use [child] DocumentTransformer to get pages back with childFilter
> to re-limit them by your query:
>
> https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents#TransformingResultDocuments-[child]-ChildDocTransformerFactory
>
> The main question is whether 6) will be able to piggyback on the
> output of 5).. And, of course, the performance...
>
> I would love to know if this works, even partially. Either on the
> mailing list or directly.
>
> Regards,
>Alex.
>
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 2 March 2016 at 00:50, Zaccheo Bagnati  wrote:
> > Thank you, Jack for your answer.
> > There are 2 reasons:
> > 1. the requirement is to show in the result list both books and chapters
> > grouped, so I would have to execute the query grouping by book, retrieve
> > first, let's say, 10 books (sorted by relevance) and then for each book
> > repeat the query grouping by chapter (always ordering by relevance) in
> > order to obtain what we need (unfortunately it is not up to me defining
> the
> > requirements... but it however make sense). Unless there exist some SOLR
> > feature to do this in only one call (and that would be great!).
> > 2. searching on pages will not match phrases that spans across 2 pages
> > (e.g. if last word of page 1 is "broken" and first word of page 2 is
> > "sentence" searching for "broken sentence" will not match)
> > However if we will not find a better solution I think that your proposal
> is
> > not so bad... I hope that reason #2 could be negligible and that #1
> > performs quite fast though we are multiplying queries.
> >
> > Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky <
> > jack.krupan...@gmail.com> ha scritto:
> >
> >> Any reason not to use the simplest structure - each page is one Solr
> >> document with a book field, a chapter field, and a page text field? You
> can
> >> then use grouping to group results by book (title text) or even chapter
> >> (title text and/or number). Maybe initially group by book and then if
> the
> >> user selects a book group you can re-query with the specific book and
> then
> >> group by chapter.
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati 
> >> wrote:
> >>
> >> > Original data is quite well structured: it comes in XML with chapters
> and
> >> > tags to mark the original page breaks on the paper version. In this
> way
> >> we
> >> > have the possibility to restructure it almost as we want before
> creating
> >> > SOLR index.
> >> >
> >> > Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
> >> > jack.krupan...@gmail.com> ha scritto:
> >> >
> >> > > To start, what is the form of your input data - is it already
> divided
> >> > into
> >> > > chapters and pages? Or... are you starting with raw PDF files?
> >> > >
> >> > >
> >> > > -- Jack Krupansky
> >> > >
> >> > > On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati  >
> >> > > wrote:
> >> > >
> >> > > > Hi all,
> >> > > > I'm searching for ideas on how to define schema and how to perform
> >> > > queries
> >> > > > in this use case: we have to index books, each book is split into
> >> > > chapters
> >> > > > and chapters are split into pages (pages represent original page
> >> > cutting
> >> > > in
> >> > > > printed version). We should show the result grouped by books and
> >> > chapters
> >> > > > (for the same book) and pages (for the same chapter). As far as I
> >> know,
> >> > > we
> >> > > > have 2 options:
> >> > > >
> >> > > > 1. index pages as SOLR documents. In this way we could
> theoretically
> >> > > > retrieve chapters (and books?)  using grouping but
> >> > > > a. we will miss 

Re: Indexing books, chapters and pages

2016-03-02 Thread Zaccheo Bagnati
Thanks Jack,
the chapter is definitely the optimal unit to search into and your solution
seems a quite good approach. The counterpart is that, depending on how
we'll choose the amount of text shared on two adjacent pages we will
experience some errors. For example will be always possible finding a
matching chapter but not finding any matching page (because searched terms
are too much far away). Let's see if this could be tolerable.

Il giorno mar 1 mar 2016 alle ore 17:44 Jack Krupansky <
jack.krupan...@gmail.com> ha scritto:

> The chapter seems like the optimal unit for initial searches - just combine
> the page text with a line break between them or index as a multivalued
> field and set the position increment gap to be 1 so that phrases work.
>
> You could have a separate collection for pages, with each page as a Solr
> document, but include the last line of text from the previous page and the
> first line of text from the next page so that phrases will match across
> page boundaries. Unfortunately, that may also result in false hits if the
> full phrase is found on the two adopted lines. That would require some
> special filtering to eliminate those false positives.
>
> There is also the question of maximum phrase size - most phrases tend to be
> reasonably short, but sometimes people may want to search for an entire
> paragraph (e.g., a quote) that may span multiple lines on two adjacent
> pages.
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 11:30 AM, Emir Arnautovic <
> emir.arnauto...@sematext.com> wrote:
>
> > Hi,
> > From the top of my head - probably does not solve problem completely, but
> > may trigger brainstorming: Index chapters and include page break tokens.
> > Use highlighting to return matches and make sure fragment size is large
> > enough to get page break token. In such scenario you should use slop for
> > phrase searches...
> >
> > More I write it, less I like it, but will not delete...
> >
> > Regards,
> > Emir
> >
> >
> > On 01.03.2016 12:56, Zaccheo Bagnati wrote:
> >
> >> Hi all,
> >> I'm searching for ideas on how to define schema and how to perform
> queries
> >> in this use case: we have to index books, each book is split into
> chapters
> >> and chapters are split into pages (pages represent original page cutting
> >> in
> >> printed version). We should show the result grouped by books and
> chapters
> >> (for the same book) and pages (for the same chapter). As far as I know,
> we
> >> have 2 options:
> >>
> >> 1. index pages as SOLR documents. In this way we could theoretically
> >> retrieve chapters (and books?)  using grouping but
> >>  a. we will miss matches across two contiguous pages (page cutting
> is
> >> only due to typographical needs so concepts could be split... as in
> >> printed
> >> books)
> >>  b. I don't know if it is possible in SOLR to group results on two
> >> different levels (books and chapters)
> >>
> >> 2. index chapters as SOLR documents. In this case we will have the right
> >> matches but how to obtain the matching pages? (we need pages because the
> >> client can only display pages)
> >>
> >> we have been struggling on this problem for a lot of time and we're  not
> >> able to find a suitable solution so I'm looking if someone has ideas or
> >> has
> >> already solved a similar issue.
> >> Thanks
> >>
> >>
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
>


Re: Indexing books, chapters and pages

2016-03-02 Thread Zaccheo Bagnati
Thanks Emir,
a similar solution has already come in my mind too: searching on chapters,
highlighting the result and retrieve matching pages parsing the highlighted
result... surely not a very efficient approach but could work...
however I think I'll try different approaches before this

Il giorno mar 1 mar 2016 alle ore 17:30 Emir Arnautovic <
emir.arnauto...@sematext.com> ha scritto:

> Hi,
>  From the top of my head - probably does not solve problem completely,
> but may trigger brainstorming: Index chapters and include page break
> tokens. Use highlighting to return matches and make sure fragment size
> is large enough to get page break token. In such scenario you should use
> slop for phrase searches...
>
> More I write it, less I like it, but will not delete...
>
> Regards,
> Emir
>
> On 01.03.2016 12:56, Zaccheo Bagnati wrote:
> > Hi all,
> > I'm searching for ideas on how to define schema and how to perform
> queries
> > in this use case: we have to index books, each book is split into
> chapters
> > and chapters are split into pages (pages represent original page cutting
> in
> > printed version). We should show the result grouped by books and chapters
> > (for the same book) and pages (for the same chapter). As far as I know,
> we
> > have 2 options:
> >
> > 1. index pages as SOLR documents. In this way we could theoretically
> > retrieve chapters (and books?)  using grouping but
> >  a. we will miss matches across two contiguous pages (page cutting is
> > only due to typographical needs so concepts could be split... as in
> printed
> > books)
> >  b. I don't know if it is possible in SOLR to group results on two
> > different levels (books and chapters)
> >
> > 2. index chapters as SOLR documents. In this case we will have the right
> > matches but how to obtain the matching pages? (we need pages because the
> > client can only display pages)
> >
> > we have been struggling on this problem for a lot of time and we're  not
> > able to find a suitable solution so I'm looking if someone has ideas or
> has
> > already solved a similar issue.
> > Thanks
> >
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>


Re: Indexing books, chapters and pages

2016-03-02 Thread Zaccheo Bagnati
Thanks Walter,
the payload idea is something that I've never heard... it seems interesting
but quite complex to implement. I think we'll have to write a custom filter
to add page numbers and it's not clear to me how to retrieve payloads in
the query result. However I'll try to go more in deep on this.
any further detail on how to use payloads?

Il giorno mar 1 mar 2016 alle ore 17:05 Walter Underwood <
wun...@wunderwood.org> ha scritto:

> You could index both pages and chapters, with a type field.
>
> You could index by chapter with the page number as a payload for each
> token.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Mar 1, 2016, at 5:50 AM, Zaccheo Bagnati  wrote:
> >
> > Thank you, Jack for your answer.
> > There are 2 reasons:
> > 1. the requirement is to show in the result list both books and chapters
> > grouped, so I would have to execute the query grouping by book, retrieve
> > first, let's say, 10 books (sorted by relevance) and then for each book
> > repeat the query grouping by chapter (always ordering by relevance) in
> > order to obtain what we need (unfortunately it is not up to me defining
> the
> > requirements... but it however make sense). Unless there exist some SOLR
> > feature to do this in only one call (and that would be great!).
> > 2. searching on pages will not match phrases that spans across 2 pages
> > (e.g. if last word of page 1 is "broken" and first word of page 2 is
> > "sentence" searching for "broken sentence" will not match)
> > However if we will not find a better solution I think that your proposal
> is
> > not so bad... I hope that reason #2 could be negligible and that #1
> > performs quite fast though we are multiplying queries.
> >
> > Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky <
> > jack.krupan...@gmail.com> ha scritto:
> >
> >> Any reason not to use the simplest structure - each page is one Solr
> >> document with a book field, a chapter field, and a page text field? You
> can
> >> then use grouping to group results by book (title text) or even chapter
> >> (title text and/or number). Maybe initially group by book and then if
> the
> >> user selects a book group you can re-query with the specific book and
> then
> >> group by chapter.
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati 
> >> wrote:
> >>
> >>> Original data is quite well structured: it comes in XML with chapters
> and
> >>> tags to mark the original page breaks on the paper version. In this way
> >> we
> >>> have the possibility to restructure it almost as we want before
> creating
> >>> SOLR index.
> >>>
> >>> Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
> >>> jack.krupan...@gmail.com> ha scritto:
> >>>
>  To start, what is the form of your input data - is it already divided
> >>> into
>  chapters and pages? Or... are you starting with raw PDF files?
> 
> 
>  -- Jack Krupansky
> 
>  On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati 
>  wrote:
> 
> > Hi all,
> > I'm searching for ideas on how to define schema and how to perform
>  queries
> > in this use case: we have to index books, each book is split into
>  chapters
> > and chapters are split into pages (pages represent original page
> >>> cutting
>  in
> > printed version). We should show the result grouped by books and
> >>> chapters
> > (for the same book) and pages (for the same chapter). As far as I
> >> know,
>  we
> > have 2 options:
> >
> > 1. index pages as SOLR documents. In this way we could theoretically
> > retrieve chapters (and books?)  using grouping but
> >a. we will miss matches across two contiguous pages (page cutting
> >>> is
> > only due to typographical needs so concepts could be split... as in
>  printed
> > books)
> >b. I don't know if it is possible in SOLR to group results on two
> > different levels (books and chapters)
> >
> > 2. index chapters as SOLR documents. In this case we will have the
> >>> right
> > matches but how to obtain the matching pages? (we need pages because
> >>> the
> > client can only display pages)
> >
> > we have been struggling on this problem for a lot of time and we're
> >>> not
> > able to find a suitable solution so I'm looking if someone has ideas
> >> or
>  has
> > already solved a similar issue.
> > Thanks
> >
> 
> >>>
> >>
>
>


Re: Indexing books, chapters and pages

2016-03-01 Thread Alexandre Rafalovitch
Here is an - untested - possible approach. I might be missing
something by combining these things in too many layers, but.

1) Have chapter as parent documents and pages as children within that.
Block index them together.
2) On pages, include page text (probably not stored) as one field.
Also include a second field that has last paragraph of that page as
well as first paragraph of the next page. This gives you phrase
matches across boundaries. Also include pageId, etc.
3) On chapters, include book id as a string field.
4) Use block join query to search against pages, but return (parent)
chapters 
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers
5) Use grouping or collapsing+expanding by book id to group chapters
within a book: https://cwiki.apache.org/confluence/display/solr/Result+Grouping
or https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
6) Use [child] DocumentTransformer to get pages back with childFilter
to re-limit them by your query:
https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents#TransformingResultDocuments-[child]-ChildDocTransformerFactory

The main question is whether 6) will be able to piggyback on the
output of 5).. And, of course, the performance...

I would love to know if this works, even partially. Either on the
mailing list or directly.

Regards,
   Alex.


Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 2 March 2016 at 00:50, Zaccheo Bagnati  wrote:
> Thank you, Jack for your answer.
> There are 2 reasons:
> 1. the requirement is to show in the result list both books and chapters
> grouped, so I would have to execute the query grouping by book, retrieve
> first, let's say, 10 books (sorted by relevance) and then for each book
> repeat the query grouping by chapter (always ordering by relevance) in
> order to obtain what we need (unfortunately it is not up to me defining the
> requirements... but it however make sense). Unless there exist some SOLR
> feature to do this in only one call (and that would be great!).
> 2. searching on pages will not match phrases that spans across 2 pages
> (e.g. if last word of page 1 is "broken" and first word of page 2 is
> "sentence" searching for "broken sentence" will not match)
> However if we will not find a better solution I think that your proposal is
> not so bad... I hope that reason #2 could be negligible and that #1
> performs quite fast though we are multiplying queries.
>
> Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky <
> jack.krupan...@gmail.com> ha scritto:
>
>> Any reason not to use the simplest structure - each page is one Solr
>> document with a book field, a chapter field, and a page text field? You can
>> then use grouping to group results by book (title text) or even chapter
>> (title text and/or number). Maybe initially group by book and then if the
>> user selects a book group you can re-query with the specific book and then
>> group by chapter.
>>
>>
>> -- Jack Krupansky
>>
>> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati 
>> wrote:
>>
>> > Original data is quite well structured: it comes in XML with chapters and
>> > tags to mark the original page breaks on the paper version. In this way
>> we
>> > have the possibility to restructure it almost as we want before creating
>> > SOLR index.
>> >
>> > Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
>> > jack.krupan...@gmail.com> ha scritto:
>> >
>> > > To start, what is the form of your input data - is it already divided
>> > into
>> > > chapters and pages? Or... are you starting with raw PDF files?
>> > >
>> > >
>> > > -- Jack Krupansky
>> > >
>> > > On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati 
>> > > wrote:
>> > >
>> > > > Hi all,
>> > > > I'm searching for ideas on how to define schema and how to perform
>> > > queries
>> > > > in this use case: we have to index books, each book is split into
>> > > chapters
>> > > > and chapters are split into pages (pages represent original page
>> > cutting
>> > > in
>> > > > printed version). We should show the result grouped by books and
>> > chapters
>> > > > (for the same book) and pages (for the same chapter). As far as I
>> know,
>> > > we
>> > > > have 2 options:
>> > > >
>> > > > 1. index pages as SOLR documents. In this way we could theoretically
>> > > > retrieve chapters (and books?)  using grouping but
>> > > > a. we will miss matches across two contiguous pages (page cutting
>> > is
>> > > > only due to typographical needs so concepts could be split... as in
>> > > printed
>> > > > books)
>> > > > b. I don't know if it is possible in SOLR to group results on two
>> > > > different levels (books and chapters)
>> > > >
>> > > > 2. index chapters as SOLR documents. In this case we will have the
>> > right
>> > > > matches but how to obtain the matching pages? (we need pages 

Re: Indexing books, chapters and pages

2016-03-01 Thread Jack Krupansky
The chapter seems like the optimal unit for initial searches - just combine
the page text with a line break between them or index as a multivalued
field and set the position increment gap to be 1 so that phrases work.

You could have a separate collection for pages, with each page as a Solr
document, but include the last line of text from the previous page and the
first line of text from the next page so that phrases will match across
page boundaries. Unfortunately, that may also result in false hits if the
full phrase is found on the two adopted lines. That would require some
special filtering to eliminate those false positives.

There is also the question of maximum phrase size - most phrases tend to be
reasonably short, but sometimes people may want to search for an entire
paragraph (e.g., a quote) that may span multiple lines on two adjacent
pages.

-- Jack Krupansky

On Tue, Mar 1, 2016 at 11:30 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi,
> From the top of my head - probably does not solve problem completely, but
> may trigger brainstorming: Index chapters and include page break tokens.
> Use highlighting to return matches and make sure fragment size is large
> enough to get page break token. In such scenario you should use slop for
> phrase searches...
>
> More I write it, less I like it, but will not delete...
>
> Regards,
> Emir
>
>
> On 01.03.2016 12:56, Zaccheo Bagnati wrote:
>
>> Hi all,
>> I'm searching for ideas on how to define schema and how to perform queries
>> in this use case: we have to index books, each book is split into chapters
>> and chapters are split into pages (pages represent original page cutting
>> in
>> printed version). We should show the result grouped by books and chapters
>> (for the same book) and pages (for the same chapter). As far as I know, we
>> have 2 options:
>>
>> 1. index pages as SOLR documents. In this way we could theoretically
>> retrieve chapters (and books?)  using grouping but
>>  a. we will miss matches across two contiguous pages (page cutting is
>> only due to typographical needs so concepts could be split... as in
>> printed
>> books)
>>  b. I don't know if it is possible in SOLR to group results on two
>> different levels (books and chapters)
>>
>> 2. index chapters as SOLR documents. In this case we will have the right
>> matches but how to obtain the matching pages? (we need pages because the
>> client can only display pages)
>>
>> we have been struggling on this problem for a lot of time and we're  not
>> able to find a suitable solution so I'm looking if someone has ideas or
>> has
>> already solved a similar issue.
>> Thanks
>>
>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>


Re: Indexing books, chapters and pages

2016-03-01 Thread Emir Arnautovic

Hi,
From the top of my head - probably does not solve problem completely, 
but may trigger brainstorming: Index chapters and include page break 
tokens. Use highlighting to return matches and make sure fragment size 
is large enough to get page break token. In such scenario you should use 
slop for phrase searches...


More I write it, less I like it, but will not delete...

Regards,
Emir

On 01.03.2016 12:56, Zaccheo Bagnati wrote:

Hi all,
I'm searching for ideas on how to define schema and how to perform queries
in this use case: we have to index books, each book is split into chapters
and chapters are split into pages (pages represent original page cutting in
printed version). We should show the result grouped by books and chapters
(for the same book) and pages (for the same chapter). As far as I know, we
have 2 options:

1. index pages as SOLR documents. In this way we could theoretically
retrieve chapters (and books?)  using grouping but
 a. we will miss matches across two contiguous pages (page cutting is
only due to typographical needs so concepts could be split... as in printed
books)
 b. I don't know if it is possible in SOLR to group results on two
different levels (books and chapters)

2. index chapters as SOLR documents. In this case we will have the right
matches but how to obtain the matching pages? (we need pages because the
client can only display pages)

we have been struggling on this problem for a lot of time and we're  not
able to find a suitable solution so I'm looking if someone has ideas or has
already solved a similar issue.
Thanks



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Indexing books, chapters and pages

2016-03-01 Thread Walter Underwood
You could index both pages and chapters, with a type field.

You could index by chapter with the page number as a payload for each token.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 1, 2016, at 5:50 AM, Zaccheo Bagnati  wrote:
> 
> Thank you, Jack for your answer.
> There are 2 reasons:
> 1. the requirement is to show in the result list both books and chapters
> grouped, so I would have to execute the query grouping by book, retrieve
> first, let's say, 10 books (sorted by relevance) and then for each book
> repeat the query grouping by chapter (always ordering by relevance) in
> order to obtain what we need (unfortunately it is not up to me defining the
> requirements... but it however make sense). Unless there exist some SOLR
> feature to do this in only one call (and that would be great!).
> 2. searching on pages will not match phrases that spans across 2 pages
> (e.g. if last word of page 1 is "broken" and first word of page 2 is
> "sentence" searching for "broken sentence" will not match)
> However if we will not find a better solution I think that your proposal is
> not so bad... I hope that reason #2 could be negligible and that #1
> performs quite fast though we are multiplying queries.
> 
> Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky <
> jack.krupan...@gmail.com> ha scritto:
> 
>> Any reason not to use the simplest structure - each page is one Solr
>> document with a book field, a chapter field, and a page text field? You can
>> then use grouping to group results by book (title text) or even chapter
>> (title text and/or number). Maybe initially group by book and then if the
>> user selects a book group you can re-query with the specific book and then
>> group by chapter.
>> 
>> 
>> -- Jack Krupansky
>> 
>> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati 
>> wrote:
>> 
>>> Original data is quite well structured: it comes in XML with chapters and
>>> tags to mark the original page breaks on the paper version. In this way
>> we
>>> have the possibility to restructure it almost as we want before creating
>>> SOLR index.
>>> 
>>> Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
>>> jack.krupan...@gmail.com> ha scritto:
>>> 
 To start, what is the form of your input data - is it already divided
>>> into
 chapters and pages? Or... are you starting with raw PDF files?
 
 
 -- Jack Krupansky
 
 On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati 
 wrote:
 
> Hi all,
> I'm searching for ideas on how to define schema and how to perform
 queries
> in this use case: we have to index books, each book is split into
 chapters
> and chapters are split into pages (pages represent original page
>>> cutting
 in
> printed version). We should show the result grouped by books and
>>> chapters
> (for the same book) and pages (for the same chapter). As far as I
>> know,
 we
> have 2 options:
> 
> 1. index pages as SOLR documents. In this way we could theoretically
> retrieve chapters (and books?)  using grouping but
>a. we will miss matches across two contiguous pages (page cutting
>>> is
> only due to typographical needs so concepts could be split... as in
 printed
> books)
>b. I don't know if it is possible in SOLR to group results on two
> different levels (books and chapters)
> 
> 2. index chapters as SOLR documents. In this case we will have the
>>> right
> matches but how to obtain the matching pages? (we need pages because
>>> the
> client can only display pages)
> 
> we have been struggling on this problem for a lot of time and we're
>>> not
> able to find a suitable solution so I'm looking if someone has ideas
>> or
 has
> already solved a similar issue.
> Thanks
> 
 
>>> 
>> 



Re: Indexing books, chapters and pages

2016-03-01 Thread Zaccheo Bagnati
Thank you, Jack for your answer.
There are 2 reasons:
1. the requirement is to show in the result list both books and chapters
grouped, so I would have to execute the query grouping by book, retrieve
first, let's say, 10 books (sorted by relevance) and then for each book
repeat the query grouping by chapter (always ordering by relevance) in
order to obtain what we need (unfortunately it is not up to me defining the
requirements... but it however make sense). Unless there exist some SOLR
feature to do this in only one call (and that would be great!).
2. searching on pages will not match phrases that spans across 2 pages
(e.g. if last word of page 1 is "broken" and first word of page 2 is
"sentence" searching for "broken sentence" will not match)
However if we will not find a better solution I think that your proposal is
not so bad... I hope that reason #2 could be negligible and that #1
performs quite fast though we are multiplying queries.

Il giorno mar 1 mar 2016 alle ore 14:28 Jack Krupansky <
jack.krupan...@gmail.com> ha scritto:

> Any reason not to use the simplest structure - each page is one Solr
> document with a book field, a chapter field, and a page text field? You can
> then use grouping to group results by book (title text) or even chapter
> (title text and/or number). Maybe initially group by book and then if the
> user selects a book group you can re-query with the specific book and then
> group by chapter.
>
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati 
> wrote:
>
> > Original data is quite well structured: it comes in XML with chapters and
> > tags to mark the original page breaks on the paper version. In this way
> we
> > have the possibility to restructure it almost as we want before creating
> > SOLR index.
> >
> > Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
> > jack.krupan...@gmail.com> ha scritto:
> >
> > > To start, what is the form of your input data - is it already divided
> > into
> > > chapters and pages? Or... are you starting with raw PDF files?
> > >
> > >
> > > -- Jack Krupansky
> > >
> > > On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati 
> > > wrote:
> > >
> > > > Hi all,
> > > > I'm searching for ideas on how to define schema and how to perform
> > > queries
> > > > in this use case: we have to index books, each book is split into
> > > chapters
> > > > and chapters are split into pages (pages represent original page
> > cutting
> > > in
> > > > printed version). We should show the result grouped by books and
> > chapters
> > > > (for the same book) and pages (for the same chapter). As far as I
> know,
> > > we
> > > > have 2 options:
> > > >
> > > > 1. index pages as SOLR documents. In this way we could theoretically
> > > > retrieve chapters (and books?)  using grouping but
> > > > a. we will miss matches across two contiguous pages (page cutting
> > is
> > > > only due to typographical needs so concepts could be split... as in
> > > printed
> > > > books)
> > > > b. I don't know if it is possible in SOLR to group results on two
> > > > different levels (books and chapters)
> > > >
> > > > 2. index chapters as SOLR documents. In this case we will have the
> > right
> > > > matches but how to obtain the matching pages? (we need pages because
> > the
> > > > client can only display pages)
> > > >
> > > > we have been struggling on this problem for a lot of time and we're
> > not
> > > > able to find a suitable solution so I'm looking if someone has ideas
> or
> > > has
> > > > already solved a similar issue.
> > > > Thanks
> > > >
> > >
> >
>


Re: Indexing books, chapters and pages

2016-03-01 Thread Jack Krupansky
Any reason not to use the simplest structure - each page is one Solr
document with a book field, a chapter field, and a page text field? You can
then use grouping to group results by book (title text) or even chapter
(title text and/or number). Maybe initially group by book and then if the
user selects a book group you can re-query with the specific book and then
group by chapter.


-- Jack Krupansky

On Tue, Mar 1, 2016 at 8:08 AM, Zaccheo Bagnati  wrote:

> Original data is quite well structured: it comes in XML with chapters and
> tags to mark the original page breaks on the paper version. In this way we
> have the possibility to restructure it almost as we want before creating
> SOLR index.
>
> Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
> jack.krupan...@gmail.com> ha scritto:
>
> > To start, what is the form of your input data - is it already divided
> into
> > chapters and pages? Or... are you starting with raw PDF files?
> >
> >
> > -- Jack Krupansky
> >
> > On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati 
> > wrote:
> >
> > > Hi all,
> > > I'm searching for ideas on how to define schema and how to perform
> > queries
> > > in this use case: we have to index books, each book is split into
> > chapters
> > > and chapters are split into pages (pages represent original page
> cutting
> > in
> > > printed version). We should show the result grouped by books and
> chapters
> > > (for the same book) and pages (for the same chapter). As far as I know,
> > we
> > > have 2 options:
> > >
> > > 1. index pages as SOLR documents. In this way we could theoretically
> > > retrieve chapters (and books?)  using grouping but
> > > a. we will miss matches across two contiguous pages (page cutting
> is
> > > only due to typographical needs so concepts could be split... as in
> > printed
> > > books)
> > > b. I don't know if it is possible in SOLR to group results on two
> > > different levels (books and chapters)
> > >
> > > 2. index chapters as SOLR documents. In this case we will have the
> right
> > > matches but how to obtain the matching pages? (we need pages because
> the
> > > client can only display pages)
> > >
> > > we have been struggling on this problem for a lot of time and we're
> not
> > > able to find a suitable solution so I'm looking if someone has ideas or
> > has
> > > already solved a similar issue.
> > > Thanks
> > >
> >
>


Re: Indexing books, chapters and pages

2016-03-01 Thread Zaccheo Bagnati
Original data is quite well structured: it comes in XML with chapters and
tags to mark the original page breaks on the paper version. In this way we
have the possibility to restructure it almost as we want before creating
SOLR index.

Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
jack.krupan...@gmail.com> ha scritto:

> To start, what is the form of your input data - is it already divided into
> chapters and pages? Or... are you starting with raw PDF files?
>
>
> -- Jack Krupansky
>
> On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati 
> wrote:
>
> > Hi all,
> > I'm searching for ideas on how to define schema and how to perform
> queries
> > in this use case: we have to index books, each book is split into
> chapters
> > and chapters are split into pages (pages represent original page cutting
> in
> > printed version). We should show the result grouped by books and chapters
> > (for the same book) and pages (for the same chapter). As far as I know,
> we
> > have 2 options:
> >
> > 1. index pages as SOLR documents. In this way we could theoretically
> > retrieve chapters (and books?)  using grouping but
> > a. we will miss matches across two contiguous pages (page cutting is
> > only due to typographical needs so concepts could be split... as in
> printed
> > books)
> > b. I don't know if it is possible in SOLR to group results on two
> > different levels (books and chapters)
> >
> > 2. index chapters as SOLR documents. In this case we will have the right
> > matches but how to obtain the matching pages? (we need pages because the
> > client can only display pages)
> >
> > we have been struggling on this problem for a lot of time and we're  not
> > able to find a suitable solution so I'm looking if someone has ideas or
> has
> > already solved a similar issue.
> > Thanks
> >
>


Re: Indexing books, chapters and pages

2016-03-01 Thread Jack Krupansky
To start, what is the form of your input data - is it already divided into
chapters and pages? Or... are you starting with raw PDF files?


-- Jack Krupansky

On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati  wrote:

> Hi all,
> I'm searching for ideas on how to define schema and how to perform queries
> in this use case: we have to index books, each book is split into chapters
> and chapters are split into pages (pages represent original page cutting in
> printed version). We should show the result grouped by books and chapters
> (for the same book) and pages (for the same chapter). As far as I know, we
> have 2 options:
>
> 1. index pages as SOLR documents. In this way we could theoretically
> retrieve chapters (and books?)  using grouping but
> a. we will miss matches across two contiguous pages (page cutting is
> only due to typographical needs so concepts could be split... as in printed
> books)
> b. I don't know if it is possible in SOLR to group results on two
> different levels (books and chapters)
>
> 2. index chapters as SOLR documents. In this case we will have the right
> matches but how to obtain the matching pages? (we need pages because the
> client can only display pages)
>
> we have been struggling on this problem for a lot of time and we're  not
> able to find a suitable solution so I'm looking if someone has ideas or has
> already solved a similar issue.
> Thanks
>


Re: Indexing books, chapters and pages

2016-03-01 Thread Zaccheo Bagnati
That's fine. But how could I get, for example, obtain a list of the pages
containing a match?

Il giorno mar 1 mar 2016 alle ore 13:01 Binoy Dalal 
ha scritto:

> Here's one idea.
> Index each chapter as a parent document and then have individual pages to
> be the child documents.
> That way for a match in any chapter, you also get the individual pages as
> documents for presentation.
>
> On Tue, 1 Mar 2016, 17:26 Zaccheo Bagnati,  wrote:
>
> > Hi all,
> > I'm searching for ideas on how to define schema and how to perform
> queries
> > in this use case: we have to index books, each book is split into
> chapters
> > and chapters are split into pages (pages represent original page cutting
> in
> > printed version). We should show the result grouped by books and chapters
> > (for the same book) and pages (for the same chapter). As far as I know,
> we
> > have 2 options:
> >
> > 1. index pages as SOLR documents. In this way we could theoretically
> > retrieve chapters (and books?)  using grouping but
> > a. we will miss matches across two contiguous pages (page cutting is
> > only due to typographical needs so concepts could be split... as in
> printed
> > books)
> > b. I don't know if it is possible in SOLR to group results on two
> > different levels (books and chapters)
> >
> > 2. index chapters as SOLR documents. In this case we will have the right
> > matches but how to obtain the matching pages? (we need pages because the
> > client can only display pages)
> >
> > we have been struggling on this problem for a lot of time and we're  not
> > able to find a suitable solution so I'm looking if someone has ideas or
> has
> > already solved a similar issue.
> > Thanks
> >
> --
> Regards,
> Binoy Dalal
>


Re: Indexing books, chapters and pages

2016-03-01 Thread Binoy Dalal
Here's one idea.
Index each chapter as a parent document and then have individual pages to
be the child documents.
That way for a match in any chapter, you also get the individual pages as
documents for presentation.

On Tue, 1 Mar 2016, 17:26 Zaccheo Bagnati,  wrote:

> Hi all,
> I'm searching for ideas on how to define schema and how to perform queries
> in this use case: we have to index books, each book is split into chapters
> and chapters are split into pages (pages represent original page cutting in
> printed version). We should show the result grouped by books and chapters
> (for the same book) and pages (for the same chapter). As far as I know, we
> have 2 options:
>
> 1. index pages as SOLR documents. In this way we could theoretically
> retrieve chapters (and books?)  using grouping but
> a. we will miss matches across two contiguous pages (page cutting is
> only due to typographical needs so concepts could be split... as in printed
> books)
> b. I don't know if it is possible in SOLR to group results on two
> different levels (books and chapters)
>
> 2. index chapters as SOLR documents. In this case we will have the right
> matches but how to obtain the matching pages? (we need pages because the
> client can only display pages)
>
> we have been struggling on this problem for a lot of time and we're  not
> able to find a suitable solution so I'm looking if someone has ideas or has
> already solved a similar issue.
> Thanks
>
-- 
Regards,
Binoy Dalal