Re: [CODE4LIB] Providing Search Across PDFs
As far as the google custom search solution, I'd add that sometimes it yields weird results : for instance, we indexed a site and for a given search term, google says about 16 results (we have 10 hits displayed on the page) and when we click on page 2, it says about 12 results (showing the two remaining hits). Ok, it says about, but it's a bit strange anyway that the system is not able to compute the proper number of hits upfront (it occurs while using labels refinement.) On the other hand, it's super easy to set up... Le 20/02/2013 20:33, Nathan Tallman a écrit : @Jason and @Michele: I'd rather stay away from a Google solution. The reason being that they don't index everything. Our sitemap is submitted nightly and out of about 6000 URLs only 1500 are indexed. I can't make sure Google indexes the PDFs or be sure that they always will. (If I'm misunderstanding this, please let me know.) @Péter: The VuFind solution I mentioned is very similar to what you use here. It uses Aperture (although soon to use Tika instead) to grab the full-text and shoves everything inside a solr index. The import is managed through a PHP script the crawls every URL on the sitemap. The only part I don't have is removing deleted, adding new, and updating changed webpages/files. I'm not sure how to rework the script to use a list of new files rather than the sitemap, but everything is on the same server so that should work. On Wed, Feb 20, 2013 at 12:53 PM, Nathan Tallman ntall...@gmail.com wrote: My institution is looking for ways to provide search across PDFs through our website. Specifically, PDFs linked from finding aids. Ideally searching within a collection's PDFs or possibly across all PDFs linked from all finding aids. We do not have a CMS or a digital repository. A digital repository is on the horizon, but it's a ways out and we need to offer the search sooner. I've looked into Swish-e but haven't had much luck getting anything off the ground. One way we know we can do this through our discovery layer VuFind, using it's ability to full-text index a website based on a sitemap (which would includes PDFs linked from finding aids). Facets could be created for collections, and we may be able to create a search box on the finding aid nav that searches specifically that collection. But, I'm not sure how scalable that solution is. The indexing agent cannot discern when a page was updated, so it has to re-scrape, everything, every-night. The impetus collection is going to have about over 1000 PDFs. And that's to start. Creating the index will start to take a long, long time. Does anyone have any ideas or know of any useful tools for this project? Doesn't have to be perfect, quick and dirty may work. (The OCR's dirty anyway :-) Thanks, Nathan -- signature *Julien Gibert* Agence Bibliographique de l'Enseignement Supérieur 227, avenue Professeur Jean Louis Viala 34193 Montpellier cedex 5 Tél : 33 (0)4 67 54 84 07 Fax : 33 (0)4 67 54 84 14
Re: [CODE4LIB] Providing Search Across PDFs
On Wed, Feb 20, 2013 at 2:33 PM, Nathan Tallman ntall...@gmail.com wrote: @Péter: The VuFind solution I mentioned is very similar to what you use here. It uses Aperture (although soon to use Tika instead) to grab the full-text and shoves everything inside a solr index. The import is managed through a PHP script the crawls every URL on the sitemap. The only part I don't have is removing deleted, adding new, and updating changed webpages/files. I'm not sure how to rework the script to use a list of new files rather than the sitemap, but everything is on the same server so that should work. Nathan, A first step could be to record a timestamp of when a particular URL is fetched. Then modify your PHP script to send an If-Modified-Since header with the request. Assuming the target server adheres to basic HTTP behavior, you'll get a 304 response and therefore know you don't have to re-index that particular item. (As an aside, could Google be ignoring items in your sitemap that it thinks haven't changed?) Maybe I'm misunderstanding though. The sitemap you mention has links to html pages which then link to the PDFs? So you have to parse the HTML to get the PDF URL? In that case, it still seems like recording the last-fetched timestamps for the PDF URLs would be an option. I know next to nothing about VuFind, so maybe the fetching mechanism isn't exposed in a way to make this possible. I'm surprised it's not already baked in, frankly. One other thing that's confusing is the notion of over 1000 PDFs taking a long, long time. Even on fairly milquetoast hardware, I'd expect solr to be capable of extracting and indexing 1000 PDF documents in 20-30 minutes. --jay
Re: [CODE4LIB] Providing Search Across PDFs
This might not fit your need exactly, but a Google Custom Search ( http://www.google.com/cse/) should do the job. You can have the Custom Search only index a given directory, or only PDFs, whichever is more useful. Jason On Wed, Feb 20, 2013 at 12:53 PM, Nathan Tallman ntall...@gmail.com wrote: My institution is looking for ways to provide search across PDFs through our website. Specifically, PDFs linked from finding aids. Ideally searching within a collection's PDFs or possibly across all PDFs linked from all finding aids. We do not have a CMS or a digital repository. A digital repository is on the horizon, but it's a ways out and we need to offer the search sooner. I've looked into Swish-e but haven't had much luck getting anything off the ground. One way we know we can do this through our discovery layer VuFind, using it's ability to full-text index a website based on a sitemap (which would includes PDFs linked from finding aids). Facets could be created for collections, and we may be able to create a search box on the finding aid nav that searches specifically that collection. But, I'm not sure how scalable that solution is. The indexing agent cannot discern when a page was updated, so it has to re-scrape, everything, every-night. The impetus collection is going to have about over 1000 PDFs. And that's to start. Creating the index will start to take a long, long time. Does anyone have any ideas or know of any useful tools for this project? Doesn't have to be perfect, quick and dirty may work. (The OCR's dirty anyway :-) Thanks, Nathan
Re: [CODE4LIB] Providing Search Across PDFs
What about just a Google site search? -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Nathan Tallman Sent: Wednesday, February 20, 2013 12:54 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Providing Search Across PDFs My institution is looking for ways to provide search across PDFs through our website. Specifically, PDFs linked from finding aids. Ideally searching within a collection's PDFs or possibly across all PDFs linked from all finding aids. We do not have a CMS or a digital repository. A digital repository is on the horizon, but it's a ways out and we need to offer the search sooner. I've looked into Swish-e but haven't had much luck getting anything off the ground. One way we know we can do this through our discovery layer VuFind, using it's ability to full-text index a website based on a sitemap (which would includes PDFs linked from finding aids). Facets could be created for collections, and we may be able to create a search box on the finding aid nav that searches specifically that collection. But, I'm not sure how scalable that solution is. The indexing agent cannot discern when a page was updated, so it has to re-scrape, everything, every-night. The impetus collection is going to have about over 1000 PDFs. And that's to start. Creating the index will start to take a long, long time. Does anyone have any ideas or know of any useful tools for this project? Doesn't have to be perfect, quick and dirty may work. (The OCR's dirty anyway :-) Thanks, Nathan
Re: [CODE4LIB] Providing Search Across PDFs
@Jason and @Michele: I'd rather stay away from a Google solution. The reason being that they don't index everything. Our sitemap is submitted nightly and out of about 6000 URLs only 1500 are indexed. I can't make sure Google indexes the PDFs or be sure that they always will. (If I'm misunderstanding this, please let me know.) @Péter: The VuFind solution I mentioned is very similar to what you use here. It uses Aperture (although soon to use Tika instead) to grab the full-text and shoves everything inside a solr index. The import is managed through a PHP script the crawls every URL on the sitemap. The only part I don't have is removing deleted, adding new, and updating changed webpages/files. I'm not sure how to rework the script to use a list of new files rather than the sitemap, but everything is on the same server so that should work. On Wed, Feb 20, 2013 at 12:53 PM, Nathan Tallman ntall...@gmail.com wrote: My institution is looking for ways to provide search across PDFs through our website. Specifically, PDFs linked from finding aids. Ideally searching within a collection's PDFs or possibly across all PDFs linked from all finding aids. We do not have a CMS or a digital repository. A digital repository is on the horizon, but it's a ways out and we need to offer the search sooner. I've looked into Swish-e but haven't had much luck getting anything off the ground. One way we know we can do this through our discovery layer VuFind, using it's ability to full-text index a website based on a sitemap (which would includes PDFs linked from finding aids). Facets could be created for collections, and we may be able to create a search box on the finding aid nav that searches specifically that collection. But, I'm not sure how scalable that solution is. The indexing agent cannot discern when a page was updated, so it has to re-scrape, everything, every-night. The impetus collection is going to have about over 1000 PDFs. And that's to start. Creating the index will start to take a long, long time. Does anyone have any ideas or know of any useful tools for this project? Doesn't have to be perfect, quick and dirty may work. (The OCR's dirty anyway :-) Thanks, Nathan
Re: [CODE4LIB] Providing Search Across PDFs
Yes, Google Custom Search is not too bad, if your PDFs are sorted meaningfully by directory, and if you submit a site map to Google for more complete indexing. You can use Xenu to make a site map, put the site map online as a static XML file, and then use Google Webmaster Tools to pass the location of the site map. This helps Google to index your site more completely. Then you periodically recreate and update the site map. For homegrown search, I would have recommended Swish-e, if you hadn't said it was out of reach. -Wilhelmina Randtke On Wed, Feb 20, 2013 at 12:07 PM, Jason Griffey grif...@gmail.com wrote: This might not fit your need exactly, but a Google Custom Search ( http://www.google.com/cse/) should do the job. You can have the Custom Search only index a given directory, or only PDFs, whichever is more useful. Jason On Wed, Feb 20, 2013 at 12:53 PM, Nathan Tallman ntall...@gmail.com wrote: My institution is looking for ways to provide search across PDFs through our website. Specifically, PDFs linked from finding aids. Ideally searching within a collection's PDFs or possibly across all PDFs linked from all finding aids. We do not have a CMS or a digital repository. A digital repository is on the horizon, but it's a ways out and we need to offer the search sooner. I've looked into Swish-e but haven't had much luck getting anything off the ground. One way we know we can do this through our discovery layer VuFind, using it's ability to full-text index a website based on a sitemap (which would includes PDFs linked from finding aids). Facets could be created for collections, and we may be able to create a search box on the finding aid nav that searches specifically that collection. But, I'm not sure how scalable that solution is. The indexing agent cannot discern when a page was updated, so it has to re-scrape, everything, every-night. The impetus collection is going to have about over 1000 PDFs. And that's to start. Creating the index will start to take a long, long time. Does anyone have any ideas or know of any useful tools for this project? Doesn't have to be perfect, quick and dirty may work. (The OCR's dirty anyway :-) Thanks, Nathan