[ts] Re: Thanking Sphinx and PDF content

James Healy Sun, 05 Jul 2009 20:08:56 -0700

Maxus wrote:
> I have a couple thousand PDFs I would like to index each pdf has a
> entry in the DB with the file path but the PDFs content is not stored
> in the DB (as it can be up to a couple meg per PDF) which is the data
> that needs to be indexed, does thinking sphinx support this setup? I
> was thinking I would inport the PDF data and combine it with the DB
> data then send it to sphinx using XML pipe 2 or is there a better way?
> Or do I need to just find a rails plugin that will let me query sphinx
> directly rather than going through a model?


Is the "couple of meg" the PDF file size or the size of the extracted
text?

If it's the PDF file size, you might be surprised at the size of the
text when it's converted to plain text. You could add a text column to
your database and use something like pdftotext to save the plain text
content.

-- James Healy <jimmy-at-deefa-dot-com>  Mon, 06 Jul 2009 13:03:59 +1000

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Thinking Sphinx" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/thinking-sphinx?hl=en
-~----------~----~----~----~------~----~------~--~---

[ts] Re: Thanking Sphinx and PDF content

Reply via email to