Re: [CODE4LIB] Providing Search Across PDFs

2013-02-21 Thread gibert julien
As far as the google custom search solution, I'd add that sometimes it 
yields weird results : for instance, we indexed a site and for a given 
search term, google says about 16 results (we have 10 hits displayed 
on the page) and when we click on page 2, it says about 12 results 
(showing the two remaining hits). Ok, it says about, but it's a bit 
strange anyway that the system is not able to compute the proper number 
of hits upfront (it occurs while using labels refinement.)

On the other hand, it's super easy to set up...

Le 20/02/2013 20:33, Nathan Tallman a écrit :

@Jason and @Michele: I'd rather stay away from a Google solution. The
reason being that they don't index everything. Our sitemap is submitted
nightly and out of about 6000 URLs only 1500 are indexed. I can't make sure
Google indexes the PDFs or be sure that they always will. (If I'm
misunderstanding this, please let me know.)

@Péter: The VuFind solution I mentioned is very similar to what you use
here. It uses Aperture (although soon to use Tika instead) to grab the
full-text and shoves everything inside a solr index. The import is managed
through a PHP script the crawls every URL on the sitemap. The only part I
don't have is removing deleted, adding new, and updating changed
webpages/files. I'm not sure how to rework the script to use a list of new
files rather than the sitemap, but everything is on the same server so that
should work.


On Wed, Feb 20, 2013 at 12:53 PM, Nathan Tallman ntall...@gmail.com wrote:


My institution is looking for ways to provide search across PDFs through
our website. Specifically, PDFs linked from finding aids. Ideally searching
within a collection's PDFs or possibly across all PDFs linked from all
finding aids.

We do not have a CMS or a digital repository. A digital repository is on
the horizon, but it's a ways out and we need to offer the search sooner.
I've looked into Swish-e but haven't had much luck getting anything off the
ground.

One way we know we can do this through our discovery layer VuFind, using
it's ability to full-text index a website based on a sitemap (which would
includes PDFs linked from finding aids). Facets could be created for
  collections, and we may be able to create a search box on the finding aid
nav that searches specifically that collection.

But, I'm not sure how scalable that solution is. The indexing agent cannot
discern when a page was updated, so it has to re-scrape,
everything, every-night. The impetus collection is going to have about over
1000 PDFs. And that's to start. Creating the index will start to take a
long, long time.

Does anyone have any ideas or know of any useful tools for this project?
Doesn't have to be perfect, quick and dirty may work. (The OCR's dirty
anyway :-)

Thanks,
Nathan







--
signature
*Julien Gibert*
Agence Bibliographique de l'Enseignement Supérieur
227, avenue Professeur Jean Louis Viala
34193 Montpellier cedex 5
Tél : 33 (0)4 67 54 84 07
Fax : 33 (0)4 67 54 84 14


Re: [CODE4LIB] Providing Search Across PDFs

2013-02-21 Thread Jay Luker
On Wed, Feb 20, 2013 at 2:33 PM, Nathan Tallman ntall...@gmail.com wrote:
 @Péter: The VuFind solution I mentioned is very similar to what you use
 here. It uses Aperture (although soon to use Tika instead) to grab the
 full-text and shoves everything inside a solr index. The import is managed
 through a PHP script the crawls every URL on the sitemap. The only part I
 don't have is removing deleted, adding new, and updating changed
 webpages/files. I'm not sure how to rework the script to use a list of new
 files rather than the sitemap, but everything is on the same server so that
 should work.

Nathan,

A first step could be to record a timestamp of when a particular URL
is fetched. Then modify your PHP script to send an If-Modified-Since
header with the request. Assuming the target server adheres to basic
HTTP behavior, you'll get a 304 response and therefore know you don't
have to re-index that particular item.

(As an aside, could Google be ignoring items in your sitemap that it
thinks haven't changed?)

Maybe I'm misunderstanding though. The sitemap you mention has links
to html pages which then link to the PDFs? So you have to parse the
HTML to get the PDF URL? In that case, it still seems like recording
the last-fetched timestamps for the PDF URLs would be an option. I
know next to nothing about VuFind, so maybe the fetching mechanism
isn't exposed in a way to make this possible. I'm surprised it's not
already baked in, frankly.

One other thing that's confusing is the notion of over 1000 PDFs
taking a long, long time. Even on fairly milquetoast hardware, I'd
expect solr to be capable of extracting and indexing 1000 PDF
documents in 20-30 minutes.

--jay


Re: [CODE4LIB] Providing Search Across PDFs

2013-02-20 Thread Jason Griffey
This might not fit your need exactly, but a Google Custom Search (
http://www.google.com/cse/) should do the job. You can have the Custom
Search only index a given directory, or only PDFs, whichever is more useful.

Jason


On Wed, Feb 20, 2013 at 12:53 PM, Nathan Tallman ntall...@gmail.com wrote:

 My institution is looking for ways to provide search across PDFs through
 our website. Specifically, PDFs linked from finding aids. Ideally searching
 within a collection's PDFs or possibly across all PDFs linked from all
 finding aids.

 We do not have a CMS or a digital repository. A digital repository is on
 the horizon, but it's a ways out and we need to offer the search sooner.
 I've looked into Swish-e but haven't had much luck getting anything off the
 ground.

 One way we know we can do this through our discovery layer VuFind, using
 it's ability to full-text index a website based on a sitemap (which would
 includes PDFs linked from finding aids). Facets could be created for
  collections, and we may be able to create a search box on the finding aid
 nav that searches specifically that collection.

 But, I'm not sure how scalable that solution is. The indexing agent cannot
 discern when a page was updated, so it has to re-scrape,
 everything, every-night. The impetus collection is going to have about over
 1000 PDFs. And that's to start. Creating the index will start to take a
 long, long time.

 Does anyone have any ideas or know of any useful tools for this project?
 Doesn't have to be perfect, quick and dirty may work. (The OCR's dirty
 anyway :-)

 Thanks,
 Nathan



Re: [CODE4LIB] Providing Search Across PDFs

2013-02-20 Thread Michele R Combs
What about just a Google site search?

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Nathan 
Tallman
Sent: Wednesday, February 20, 2013 12:54 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Providing Search Across PDFs

My institution is looking for ways to provide search across PDFs through our 
website. Specifically, PDFs linked from finding aids. Ideally searching within 
a collection's PDFs or possibly across all PDFs linked from all finding aids.

We do not have a CMS or a digital repository. A digital repository is on the 
horizon, but it's a ways out and we need to offer the search sooner.
I've looked into Swish-e but haven't had much luck getting anything off the 
ground.

One way we know we can do this through our discovery layer VuFind, using it's 
ability to full-text index a website based on a sitemap (which would includes 
PDFs linked from finding aids). Facets could be created for  collections, and 
we may be able to create a search box on the finding aid nav that searches 
specifically that collection.

But, I'm not sure how scalable that solution is. The indexing agent cannot 
discern when a page was updated, so it has to re-scrape, everything, 
every-night. The impetus collection is going to have about over
1000 PDFs. And that's to start. Creating the index will start to take a long, 
long time.

Does anyone have any ideas or know of any useful tools for this project?
Doesn't have to be perfect, quick and dirty may work. (The OCR's dirty anyway 
:-)

Thanks,
Nathan


Re: [CODE4LIB] Providing Search Across PDFs

2013-02-20 Thread Nathan Tallman
@Jason and @Michele: I'd rather stay away from a Google solution. The
reason being that they don't index everything. Our sitemap is submitted
nightly and out of about 6000 URLs only 1500 are indexed. I can't make sure
Google indexes the PDFs or be sure that they always will. (If I'm
misunderstanding this, please let me know.)

@Péter: The VuFind solution I mentioned is very similar to what you use
here. It uses Aperture (although soon to use Tika instead) to grab the
full-text and shoves everything inside a solr index. The import is managed
through a PHP script the crawls every URL on the sitemap. The only part I
don't have is removing deleted, adding new, and updating changed
webpages/files. I'm not sure how to rework the script to use a list of new
files rather than the sitemap, but everything is on the same server so that
should work.


On Wed, Feb 20, 2013 at 12:53 PM, Nathan Tallman ntall...@gmail.com wrote:

 My institution is looking for ways to provide search across PDFs through
 our website. Specifically, PDFs linked from finding aids. Ideally searching
 within a collection's PDFs or possibly across all PDFs linked from all
 finding aids.

 We do not have a CMS or a digital repository. A digital repository is on
 the horizon, but it's a ways out and we need to offer the search sooner.
 I've looked into Swish-e but haven't had much luck getting anything off the
 ground.

 One way we know we can do this through our discovery layer VuFind, using
 it's ability to full-text index a website based on a sitemap (which would
 includes PDFs linked from finding aids). Facets could be created for
  collections, and we may be able to create a search box on the finding aid
 nav that searches specifically that collection.

 But, I'm not sure how scalable that solution is. The indexing agent cannot
 discern when a page was updated, so it has to re-scrape,
 everything, every-night. The impetus collection is going to have about over
 1000 PDFs. And that's to start. Creating the index will start to take a
 long, long time.

 Does anyone have any ideas or know of any useful tools for this project?
 Doesn't have to be perfect, quick and dirty may work. (The OCR's dirty
 anyway :-)

 Thanks,
 Nathan






Re: [CODE4LIB] Providing Search Across PDFs

2013-02-20 Thread Wilhelmina Randtke
Yes, Google Custom Search is not too bad, if your PDFs are sorted
meaningfully by directory, and if you submit a site map to Google for more
complete indexing.  You can use Xenu to make a site map, put the site map
online as a static XML file, and then use Google Webmaster Tools to pass
the location of the site map.  This helps Google to index your site more
completely.  Then you periodically recreate and update the site map.

For homegrown search, I would have recommended Swish-e, if you hadn't said
it was out of reach.

-Wilhelmina Randtke


On Wed, Feb 20, 2013 at 12:07 PM, Jason Griffey grif...@gmail.com wrote:

 This might not fit your need exactly, but a Google Custom Search (
 http://www.google.com/cse/) should do the job. You can have the Custom
 Search only index a given directory, or only PDFs, whichever is more
 useful.

 Jason


 On Wed, Feb 20, 2013 at 12:53 PM, Nathan Tallman ntall...@gmail.com
 wrote:

  My institution is looking for ways to provide search across PDFs through
  our website. Specifically, PDFs linked from finding aids. Ideally
 searching
  within a collection's PDFs or possibly across all PDFs linked from all
  finding aids.
 
  We do not have a CMS or a digital repository. A digital repository is on
  the horizon, but it's a ways out and we need to offer the search sooner.
  I've looked into Swish-e but haven't had much luck getting anything off
 the
  ground.
 
  One way we know we can do this through our discovery layer VuFind, using
  it's ability to full-text index a website based on a sitemap (which would
  includes PDFs linked from finding aids). Facets could be created for
   collections, and we may be able to create a search box on the finding
 aid
  nav that searches specifically that collection.
 
  But, I'm not sure how scalable that solution is. The indexing agent
 cannot
  discern when a page was updated, so it has to re-scrape,
  everything, every-night. The impetus collection is going to have about
 over
  1000 PDFs. And that's to start. Creating the index will start to take a
  long, long time.
 
  Does anyone have any ideas or know of any useful tools for this project?
  Doesn't have to be perfect, quick and dirty may work. (The OCR's dirty
  anyway :-)
 
  Thanks,
  Nathan