[dspace-tech] Re: BookReader integration in DSpace + fulltext searching inside the document

Pedro Amorim Mon, 10 Apr 2017 10:15:11 -0700

Hello all,

After a bit of prototyping I decided to go with SOLR dynamic fields and 
querying SOLR with fl (field list) returning only the field (or word) 
requested.


However, I first need to override the ImageMagickPdfThumbnailFilter 
media-filter in order to not only create a small thumbnail, but also create 
a large thumbnail for every page in the respective PDF file, and store 
those new JPEG files in a new custom bundle.

After taking a look at the mediafilter/MediaFilterServiceImpl.java and 
realizing it's not very flexible (it's not easy to create multiple files 
out of just 1), maybe the best route here would be to override the 
postProcess method and have the method create them after the first small 
thumbnail is completed.

If anyone has implemented postProcessing on any media filter before please 
do advise, as I didn't find much snooping around.

Again, thanks.

Pedro Amorim

quinta-feira, 6 de Abril de 2017 às 17:23:34 UTC, Pedro Amorim escreveu:
>
> Hello everyone,
>
> I'm currently trying to plan an implementation of this and wanted to ask 
> the opinion of the developers on how to go about it.
>
> I have seen the great resources provided by Peter Dietz@LongSight 
> regarding the integration of BookReader 
> <https://github.com/internetarchive/bookreader> in DSpace such as the video 
> demo <https://www.youtube.com/watch?v=mZkvfxPrwZw> and the source code 
> <https://github.com/peterdietz/DSpace/tree/bookreader/dspace-xmlui/src/main/webapp/themes/wheaton-mirage2/vendor/BookReader>
> .
> And all works great, provided that the bitstreams contained in the item 
> follow a specific nomenclature (001.jpg, 002.jpg, 003.jpg, etc) so that the 
> client app can request/render them in the correct order and request page 
> ranges, etc.
>
> However, the feature of searching within the document itself is disabled, 
> because - I believe - this particular feature needs a backend to supply the 
> client app with the needed information.
> This can be seen in production in archive.org or with a specific example 
> of searching the term *Socrates* within a book 
> <https://archive.org/stream/in.ernet.dli.2015.50197/2015.50197.Plato#page/n105/mode/2up/search/Socrates>
> .
>
> The backend from internet archives' BookReader returns a JSON entry for 
> every hit, example:
>
> {
>     "text": "fly towards him, nestle in his breast, and then spread its 
> wings and soai upwards, singing most sweetly The next morning Ariston 
> appeared, leading his son Plato to the philosopher, and {{{Socrates}}} knew 
> that his dieam was fulfilled", 
>     "par": [
>         {
>             "boxes": [
>                 {
>                     "r": 694, 
>                     "b": 412, 
>                     "t": 358, 
>                     "page": 10, 
>                     "l": 531
>                 }
>             ], 
>             "b": 463, 
>             "t": 172, 
>             "page_width": 1243, 
>             "r": 1146, 
>             "l": 28, 
>             "page_height": 2123, 
>             "page": 10
>         }
>     ]
> }
>
> This makes sense because with this info the client app can *1)* correctly 
> pinpoint the specific pages where the term is found and *2)* correctly 
> render the highlight box around the searched term within the page being 
> presented using the 'coordinates' and dimensions.
>
> *Assuming:*
> 1) Have all the required bitstreams in jpeg format and in the correct 
> naming convention mentioned above;
> 2) Have the required word location information in ALTO.xml files (DSpace 
> wouldn't generate that info, need only to process/serve it).
>
> *How would one have DSpace act as a backend for the BookReader client app?*
>
> The best theorycrafting I've come up with thus far is to build a custom 
> media-filter that would interpret the word information contained in the 
> ALTO.xml files for each item, and store this information in a new custom 
> SOLR index, that would afterwards be queried by the client app. Every item 
> would have their own word index with information for each word (page, 
> width, height, vpos, hpos), this means this particular index would have to 
> be repeated for every word and serve only the *hits* to the client app.
>
> For example, the following query:
>
> <DSpaceURL>/solr/search/select?q=search.resourceid:<itemID>&word.value=<searchTerm>
>
> Would return the information for all the occurrences of the *word* index 
> with the value <searchTerm> (above ex: Socrates). 
> IF this would be accomplished, in theory, it would work.
>
> Has anyone got other idea on this? Or implemented something similar 
> before? Or thought about it before?
>
> Sorry for the wall of text.
>
> Thanks as always,
>
> Pedro Amorim
>
>
>
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dspace-tech+unsubscr...@googlegroups.com.
To post to this group, send email to dspace-tech@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

[dspace-tech] Re: BookReader integration in DSpace + fulltext searching inside the document

Reply via email to