If someone of you cares about his Stackoverflow reputation and has time to
do it I also opened a question there:
http://stackoverflow.com/questions/35722672/solr-schema-to-model-books-chapters-and-pages.
Thanks again to everybody
Il giorno mer 2 mar 2016 alle ore 09:42 Zaccheo Bagnati
Thanks Alexandre,
your solution seems very good: I'll surely try it and let you know. I like
the Idea of mixing blockjoins and grouping!
Il giorno mer 2 mar 2016 alle ore 04:46 Alexandre Rafalovitch <
arafa...@gmail.com> ha scritto:
> Here is an - untested - possible approach. I might be missing
Thanks Jack,
the chapter is definitely the optimal unit to search into and your solution
seems a quite good approach. The counterpart is that, depending on how
we'll choose the amount of text shared on two adjacent pages we will
experience some errors. For example will be always possible finding a
Thanks Emir,
a similar solution has already come in my mind too: searching on chapters,
highlighting the result and retrieve matching pages parsing the highlighted
result... surely not a very efficient approach but could work...
however I think I'll try different approaches before this
Il giorno
Thanks Walter,
the payload idea is something that I've never heard... it seems interesting
but quite complex to implement. I think we'll have to write a custom filter
to add page numbers and it's not clear to me how to retrieve payloads in
the query result. However I'll try to go more in deep on
Here is an - untested - possible approach. I might be missing
something by combining these things in too many layers, but.
1) Have chapter as parent documents and pages as children within that.
Block index them together.
2) On pages, include page text (probably not stored) as one field.
Also
The chapter seems like the optimal unit for initial searches - just combine
the page text with a line break between them or index as a multivalued
field and set the position increment gap to be 1 so that phrases work.
You could have a separate collection for pages, with each page as a Solr
Hi,
From the top of my head - probably does not solve problem completely,
but may trigger brainstorming: Index chapters and include page break
tokens. Use highlighting to return matches and make sure fragment size
is large enough to get page break token. In such scenario you should use
slop
You could index both pages and chapters, with a type field.
You could index by chapter with the page number as a payload for each token.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Mar 1, 2016, at 5:50 AM, Zaccheo Bagnati
Thank you, Jack for your answer.
There are 2 reasons:
1. the requirement is to show in the result list both books and chapters
grouped, so I would have to execute the query grouping by book, retrieve
first, let's say, 10 books (sorted by relevance) and then for each book
repeat the query grouping
Any reason not to use the simplest structure - each page is one Solr
document with a book field, a chapter field, and a page text field? You can
then use grouping to group results by book (title text) or even chapter
(title text and/or number). Maybe initially group by book and then if the
user
Original data is quite well structured: it comes in XML with chapters and
tags to mark the original page breaks on the paper version. In this way we
have the possibility to restructure it almost as we want before creating
SOLR index.
Il giorno mar 1 mar 2016 alle ore 14:04 Jack Krupansky <
To start, what is the form of your input data - is it already divided into
chapters and pages? Or... are you starting with raw PDF files?
-- Jack Krupansky
On Tue, Mar 1, 2016 at 6:56 AM, Zaccheo Bagnati wrote:
> Hi all,
> I'm searching for ideas on how to define schema
That's fine. But how could I get, for example, obtain a list of the pages
containing a match?
Il giorno mar 1 mar 2016 alle ore 13:01 Binoy Dalal
ha scritto:
> Here's one idea.
> Index each chapter as a parent document and then have individual pages to
> be the child
Here's one idea.
Index each chapter as a parent document and then have individual pages to
be the child documents.
That way for a match in any chapter, you also get the individual pages as
documents for presentation.
On Tue, 1 Mar 2016, 17:26 Zaccheo Bagnati, wrote:
> Hi
Hi all,
I'm searching for ideas on how to define schema and how to perform queries
in this use case: we have to index books, each book is split into chapters
and chapters are split into pages (pages represent original page cutting in
printed version). We should show the result grouped by books and
16 matches
Mail list logo