Hi Everyone! wow thanks for so many helpful replies.
Sorry about the typo in the heading was meant to be "thinking" rather than "thanking". "Thinking Sphinx doesn't support XML piped data at this point in time (although I'd like to add it eventually), so I'm afraid your best option is to look into Sphinx libraries that don't go through ActiveRecord. Riddle might be useful for that (the Ruby API for Sphinx I extracted from an early version of Thinking Sphinx), although documentation is light. It follows Sphinx's structure pretty closely though." thanks for that Pat, I thought that might be the case, you almost need a way to tie the data into the model by having a field that doesn't get saved to the database but does get passed to sphinx and maintains the relationship between the two using the document Id. I would imagine thats fairly tricky stuff, I would have a bash at it but this is literly my first weekend getting my head around ruby in general. "Is the "couple of meg" the PDF file size or the size of the extracted text? If it's the PDF file size, you might be surprised at the size of the text when it's converted to plain text. You could add a text column to your database and use something like pdftotext to save the plain text content." Interesting idea James, my main concern was scaling the application, might be okay for a few 1000 pdfs now but I would imagine it would grow, some of the PDF have a 2 or 3 mbs of text once extracted. Also once the text is added to the index I really have no use for it after that, the document wont be updated it is kind of read only. I might run a tests on a couple thousand documents just to see how it floats. Thanks for your time everyone I will have a play and see how I go. Cheers! Maxus On Jul 6, 11:08 am, James Healy <[email protected]> wrote: > Maxus wrote: > > I have a couple thousand PDFs I would like to index each pdf has a > > entry in the DB with the file path but the PDFs content is not stored > > in the DB (as it can be up to a couple meg per PDF) which is the data > > that needs to be indexed, does thinking sphinx support this setup? I > > was thinking I would inport the PDF data and combine it with the DB > > data then send it to sphinx using XML pipe 2 or is there a better way? > > Or do I need to just find a rails plugin that will let me query sphinx > > directly rather than going through a model? > > Is the "couple of meg" the PDF file size or the size of the extracted > text? > > If it's the PDF file size, you might be surprised at the size of the > text when it's converted to plain text. You could add a text column to > your database and use something like pdftotext to save the plain text > content. > > -- James Healy <jimmy-at-deefa-dot-com> Mon, 06 Jul 2009 13:03:59 +1000 --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Thinking Sphinx" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/thinking-sphinx?hl=en -~----------~----~----~----~------~----~------~--~---
