[ts] Re: Thanking Sphinx and PDF content

Maxus Mon, 06 Jul 2009 06:12:06 -0700

Hi Everyone!

wow thanks for so many helpful replies.

Sorry about the typo in the heading was meant to be "thinking" rather
than "thanking".

"Thinking Sphinx doesn't support XML piped data at this point in
time
(although I'd like to add it eventually), so I'm afraid your best
option is to look into Sphinx libraries that don't go through
ActiveRecord. Riddle might be useful for that (the Ruby API for
Sphinx
I extracted from an early version of Thinking Sphinx), although
documentation is light. It follows Sphinx's structure pretty
closely
though."

thanks for that Pat, I thought that might be the case, you almost need
a way to tie the data into the model by having a field that doesn't
get saved to the database but does get passed to sphinx and maintains
the relationship between the two using the document Id. I would
imagine thats fairly tricky stuff, I would have a bash at it but this
is literly my first weekend getting my head around ruby in general.

"Is the "couple of meg" the PDF file size or the size of the
extracted
text?

If it's the PDF file size, you might be surprised at the size of the
text when it's converted to plain text. You could add a text column
to
your database and use something like pdftotext to save the plain text
content."

Interesting idea James, my main concern was scaling the application,
might be okay for a few 1000 pdfs now but I would imagine it would
grow, some of the PDF have a 2 or 3 mbs of text once extracted. Also
once the text is added to the index I really have no use for it after
that, the document wont be updated it is kind of read only. I might
run a tests on a couple thousand documents just to see how it floats.

Thanks for your time everyone I will have a play and see how I go.

Cheers!
Maxus

On Jul 6, 11:08 am, James Healy <[email protected]> wrote:
> Maxus wrote:
> > I have a couple thousand PDFs I would like to index each pdf has a
> > entry in the DB with the file path but the PDFs content is not stored
> > in the DB (as it can be up to a couple meg per PDF) which is the data
> > that needs to be indexed, does thinking sphinx support this setup? I
> > was thinking I would inport the PDF data and combine it with the DB
> > data then send it to sphinx using XML pipe 2 or is there a better way?
> > Or do I need to just find a rails plugin that will let me query sphinx
> > directly rather than going through a model?
>
> Is the "couple of meg" the PDF file size or the size of the extracted
> text?
>
> If it's the PDF file size, you might be surprised at the size of the
> text when it's converted to plain text. You could add a text column to
> your database and use something like pdftotext to save the plain text
> content.
>
> -- James Healy <jimmy-at-deefa-dot-com>  Mon, 06 Jul 2009 13:03:59 +1000
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Thinking Sphinx" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/thinking-sphinx?hl=en
-~----------~----~----~----~------~----~------~--~---

[ts] Re: Thanking Sphinx and PDF content

Reply via email to