Re: [PHP] storing searching docs
On Dec 13, 2012 4:50 PM, Jim Giner jim.gi...@albanyhandball.com wrote: Thanks for all the posts. After reading and googling all afternoon, I think the best approach for me is: Create two macros in Word (done!) to export each of my .doc files to .txt and .pdf formats. Create a sql table to hold the .txt contents of my .doc files, along with a reference to the meeting date and the name of the corresponding .pdf file. Upload my two sets of files with an ftp client and then use a script to load the table with my .txt file data. Why not use php to upload the set of files? Now I just need a couple of scripts to allow a user to locate a file and bring up the pdf for when he wants to read about a meeting. And a second script to accept user input (search words) and perform a query against the textual data and present some kind of results - probably a listing containing a reference to the meeting date and a tbd-length string showing the matching result for each occurrence, ie, something like n chars in front of and after the match so the user can see the context of the match. Sizes - a 28k .doc file grows to 142kb in .pdf format and is only 5kb in .txt format. (actually, if I 'print' the .doc as a pdf instead of using the Word's File,Save as, the resulting pdf is only 70kb. Might need a new macro!) Thanks again! -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] storing searching docs
On Dec 13, 2012 4:50 PM, Jim Giner jim.gi...@albanyhandball.com wrote: Thanks for all the posts. After reading and googling all afternoon, I think the best approach for me is: Create two macros in Word (done!) to export each of my .doc files to .txt and .pdf formats. Create a sql table to hold the .txt contents of my .doc files, along with a reference to the meeting date and the name of the corresponding .pdf file. Upload my two sets of files with an ftp client and then use a script to load the table with my .txt file data. Now I just need a couple of scripts to allow a user to locate a file and bring up the pdf for when he wants to read about a meeting. And a second script to accept user input (search words) and perform a query against the textual data and present some kind of results - probably a listing containing a reference to the meeting date and a tbd-length string showing the matching result for each occurrence, ie, something like n chars in front of and after the match so the user can see the context of the match. Sizes - a 28k .doc file grows to 142kb in .pdf format and is only 5kb in .txt format. (actually, if I 'print' the .doc as a pdf instead of using the Word's File,Save as, the resulting pdf is only 70kb. Might need a new macro!) PDF might be better looking than this, but how big is an HTML doc exported from Word? Thanks again! -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] storing searching docs
On Dec 15, 2012 7:29 AM, tamouse mailing lists tamouse.li...@gmail.com wrote: On Dec 13, 2012 4:50 PM, Jim Giner jim.gi...@albanyhandball.com wrote: Thanks for all the posts. After reading and googling all afternoon, I think the best approach for me is: Create two macros in Word (done!) to export each of my .doc files to .txt and .pdf formats. Create a sql table to hold the .txt contents of my .doc files, along with a reference to the meeting date and the name of the corresponding .pdf file. Upload my two sets of files with an ftp client and then use a script to load the table with my .txt file data. Now I just need a couple of scripts to allow a user to locate a file and bring up the pdf for when he wants to read about a meeting. And a second script to accept user input (search words) and perform a query against the textual data and present some kind of results - probably a listing containing a reference to the meeting date and a tbd-length string showing the matching result for each occurrence, ie, something like n chars in front of and after the match so the user can see the context of the match. Sizes - a 28k .doc file grows to 142kb in .pdf format and is only 5kb in .txt format. (actually, if I 'print' the .doc as a pdf instead of using the Word's File,Save as, the resulting pdf is only 70kb. Might need a new macro!) PDF might be better looking than this, but how big is an HTML doc exported from Word? Sorry for the disjointed replies, it's still early... You could export just the HTML, upload it, and your script could strip the HTML to have both formats available, I.e. plain text for indexing, HTML for presentation... or even, say, run the HTML through pandoc and produce markdown... As I say, it's early, these might be bad ideas, but it's how I'd approach it. Thanks again! -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] storing searching docs
On 12/15/2012 8:26 AM, tamouse mailing lists wrote: On Dec 13, 2012 4:50 PM, Jim Giner jim.gi...@albanyhandball.com wrote: Thanks for all the posts. After reading and googling all afternoon, I think the best approach for me is: Create two macros in Word (done!) to export each of my .doc files to .txt and .pdf formats. Create a sql table to hold the .txt contents of my .doc files, along with a reference to the meeting date and the name of the corresponding .pdf file. Upload my two sets of files with an ftp client and then use a script to load the table with my .txt file data. Why not use php to upload the set of files? Now I just need a couple of scripts to allow a user to locate a file and bring up the pdf for when he wants to read about a meeting. And a second script to accept user input (search words) and perform a query against the textual data and present some kind of results - probably a listing containing a reference to the meeting date and a tbd-length string showing the matching result for each occurrence, ie, something like n chars in front of and after the match so the user can see the context of the match. Sizes - a 28k .doc file grows to 142kb in .pdf format and is only 5kb in .txt format. (actually, if I 'print' the .doc as a pdf instead of using the Word's File,Save as, the resulting pdf is only 70kb. Might need a new macro!) Thanks again! -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php cause I dont' now how php could do such a thing? The only way I know of is thru a 'file' input on an html page which is a pia since I would have to do it for each file. With an ftp client I can just drag/drop the files in 10 seconds. In the future, as I add additional docs, one at a time, I'll have a simple html form for doing that. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] storing searching docs
On 12/15/2012 8:29 AM, tamouse mailing lists wrote: On Dec 13, 2012 4:50 PM, Jim Giner jim.gi...@albanyhandball.com wrote: Thanks for all the posts. After reading and googling all afternoon, I think the best approach for me is: Create two macros in Word (done!) to export each of my .doc files to .txt and .pdf formats. Create a sql table to hold the .txt contents of my .doc files, along with a reference to the meeting date and the name of the corresponding .pdf file. Upload my two sets of files with an ftp client and then use a script to load the table with my .txt file data. Now I just need a couple of scripts to allow a user to locate a file and bring up the pdf for when he wants to read about a meeting. And a second script to accept user input (search words) and perform a query against the textual data and present some kind of results - probably a listing containing a reference to the meeting date and a tbd-length string showing the matching result for each occurrence, ie, something like n chars in front of and after the match so the user can see the context of the match. Sizes - a 28k .doc file grows to 142kb in .pdf format and is only 5kb in .txt format. (actually, if I 'print' the .doc as a pdf instead of using the Word's File,Save as, the resulting pdf is only 70kb. Might need a new macro!) PDF might be better looking than this, but how big is an HTML doc exported from Word? Thanks again! -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php Word generates very many many words (!) when creating an html doc. Not a good html generator at all. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] storing searching docs
On Sat, 2012-12-15 at 12:21 -0500, Jim Giner wrote: On 12/15/2012 8:26 AM, tamouse mailing lists wrote: On Dec 13, 2012 4:50 PM, Jim Giner jim.gi...@albanyhandball.com wrote: Thanks for all the posts. After reading and googling all afternoon, I think the best approach for me is: Create two macros in Word (done!) to export each of my .doc files to .txt and .pdf formats. Create a sql table to hold the .txt contents of my .doc files, along with a reference to the meeting date and the name of the corresponding .pdf file. Upload my two sets of files with an ftp client and then use a script to load the table with my .txt file data. Why not use php to upload the set of files? Now I just need a couple of scripts to allow a user to locate a file and bring up the pdf for when he wants to read about a meeting. And a second script to accept user input (search words) and perform a query against the textual data and present some kind of results - probably a listing containing a reference to the meeting date and a tbd-length string showing the matching result for each occurrence, ie, something like n chars in front of and after the match so the user can see the context of the match. Sizes - a 28k .doc file grows to 142kb in .pdf format and is only 5kb in .txt format. (actually, if I 'print' the .doc as a pdf instead of using the Word's File,Save as, the resulting pdf is only 70kb. Might need a new macro!) Thanks again! -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php cause I dont' now how php could do such a thing? The only way I know of is thru a 'file' input on an html page which is a pia since I would have to do it for each file. With an ftp client I can just drag/drop the files in 10 seconds. In the future, as I add additional docs, one at a time, I'll have a simple html form for doing that. I believe Chrome supports drag and drop for file inputs now. I do know that Chrome and Firefox support multiple uploads from one form element without the need for things like Uploadify. Thanks, Ash http://www.ashleysheridan.co.uk
Re: [PHP] storing searching docs
On Sat, Dec 15, 2012 at 11:21 AM, Jim Giner jim.gi...@albanyhandball.com wrote: On 12/15/2012 8:26 AM, tamouse mailing lists wrote: On Dec 13, 2012 4:50 PM, Jim Giner jim.gi...@albanyhandball.com wrote: Thanks for all the posts. After reading and googling all afternoon, I think the best approach for me is: Create two macros in Word (done!) to export each of my .doc files to .txt and .pdf formats. Create a sql table to hold the .txt contents of my .doc files, along with a reference to the meeting date and the name of the corresponding .pdf file. Upload my two sets of files with an ftp client and then use a script to load the table with my .txt file data. Why not use php to upload the set of files? Now I just need a couple of scripts to allow a user to locate a file and bring up the pdf for when he wants to read about a meeting. And a second script to accept user input (search words) and perform a query against the textual data and present some kind of results - probably a listing containing a reference to the meeting date and a tbd-length string showing the matching result for each occurrence, ie, something like n chars in front of and after the match so the user can see the context of the match. Sizes - a 28k .doc file grows to 142kb in .pdf format and is only 5kb in .txt format. (actually, if I 'print' the .doc as a pdf instead of using the Word's File,Save as, the resulting pdf is only 70kb. Might need a new macro!) Thanks again! -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php cause I dont' now how php could do such a thing? The only way I know of is thru a 'file' input on an html page which is a pia since I would have to do it for each file. With an ftp client I can just drag/drop the files in 10 seconds. In the future, as I add additional docs, one at a time, I'll have a simple html form for doing that. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php Yeah, bulk upload is a bigger problem. I was thinking just the one-at-a-time thing. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] storing searching docs
On Sat, Dec 15, 2012 at 11:22 AM, Jim Giner jim.gi...@albanyhandball.com wrote: On 12/15/2012 8:29 AM, tamouse mailing lists wrote: On Dec 13, 2012 4:50 PM, Jim Giner jim.gi...@albanyhandball.com wrote: Thanks for all the posts. After reading and googling all afternoon, I think the best approach for me is: Create two macros in Word (done!) to export each of my .doc files to .txt and .pdf formats. Create a sql table to hold the .txt contents of my .doc files, along with a reference to the meeting date and the name of the corresponding .pdf file. Upload my two sets of files with an ftp client and then use a script to load the table with my .txt file data. Now I just need a couple of scripts to allow a user to locate a file and bring up the pdf for when he wants to read about a meeting. And a second script to accept user input (search words) and perform a query against the textual data and present some kind of results - probably a listing containing a reference to the meeting date and a tbd-length string showing the matching result for each occurrence, ie, something like n chars in front of and after the match so the user can see the context of the match. Sizes - a 28k .doc file grows to 142kb in .pdf format and is only 5kb in .txt format. (actually, if I 'print' the .doc as a pdf instead of using the Word's File,Save as, the resulting pdf is only 70kb. Might need a new macro!) PDF might be better looking than this, but how big is an HTML doc exported from Word? Thanks again! -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php Word generates very many many words (!) when creating an html doc. Not a good html generator at all. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php I think my next email talked about sending the HTML through pandoc to make a plain text file, perhaps in markdown, which could be the thing you save, and then run it through a markdown filter to produce (a much, much leaner) HTML. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] storing searching docs
I think im good with a text for the db and search capability and the pdf for pure display jg On Dec 15, 2012, at 5:31 PM, tamouse mailing lists tamouse.li...@gmail.com wrote: On Sat, Dec 15, 2012 at 11:22 AM, Jim Giner jim.gi...@albanyhandball.com wrote: On 12/15/2012 8:29 AM, tamouse mailing lists wrote: On Dec 13, 2012 4:50 PM, Jim Giner jim.gi...@albanyhandball.com wrote: Thanks for all the posts. After reading and googling all afternoon, I think the best approach for me is: Create two macros in Word (done!) to export each of my .doc files to .txt and .pdf formats. Create a sql table to hold the .txt contents of my .doc files, along with a reference to the meeting date and the name of the corresponding .pdf file. Upload my two sets of files with an ftp client and then use a script to load the table with my .txt file data. Now I just need a couple of scripts to allow a user to locate a file and bring up the pdf for when he wants to read about a meeting. And a second script to accept user input (search words) and perform a query against the textual data and present some kind of results - probably a listing containing a reference to the meeting date and a tbd-length string showing the matching result for each occurrence, ie, something like n chars in front of and after the match so the user can see the context of the match. Sizes - a 28k .doc file grows to 142kb in .pdf format and is only 5kb in .txt format. (actually, if I 'print' the .doc as a pdf instead of using the Word's File,Save as, the resulting pdf is only 70kb. Might need a new macro!) PDF might be better looking than this, but how big is an HTML doc exported from Word? Thanks again! -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php Word generates very many many words (!) when creating an html doc. Not a good html generator at all. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php I think my next email talked about sending the HTML through pandoc to make a plain text file, perhaps in markdown, which could be the thing you save, and then run it through a markdown filter to produce (a much, much leaner) HTML. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] storing searching docs
Thanks for the input gentlemen. Two opposing viewpoints! I understand the concept of using files for the docs and a table to locate them and id them. But I am of the opinion that modern dbs are capable of handling very large objects (of which these docs are NOT!) much easier than years ago, so I am leaning that way still. It will certainly make my search process easier! More comments anyone? -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] storing searching docs
On Thu, Dec 13, 2012 at 3:10 PM, Jim Giner jim.gi...@albanyhandball.comwrote: Thanks for the input gentlemen. Two opposing viewpoints! I understand the concept of using files for the docs and a table to locate them and id them. But I am of the opinion that modern dbs are capable of handling very large objects (of which these docs are NOT!) much easier than years ago, so I am leaning that way still. It will certainly make my search process easier! More comments anyone? I'm not sure if there's much difference between large text fields and blobs, but I had a database (MySQL) with rows that had one blob each of 5-10 mb. At around 200-300 rows the database was pretty slow. After reaching about 2000 rows, it was terrible. Opening the database with phpMyAdmin (which executes just select with LIMIT 1, 30), took around 6 seconds. Doing a order by on one of the other rows, it took a few minutes.. I tried both InnoDB and MyISAM for storage, but that didn't make much of a difference. So it depends on how large your docs are I guess.. - Matijn
Re: [PHP] storing searching docs
On 12/13/2012 9:19 AM, Matijn Woudt wrote: On Thu, Dec 13, 2012 at 3:10 PM, Jim Giner jim.gi...@albanyhandball.comwrote: I'm not sure if there's much difference between large text fields and blobs, but I had a database (MySQL) with rows that had one blob each of 5-10 mb. At around 200-300 rows the database was pretty slow. After reaching about 2000 rows, it was terrible. Opening the database with phpMyAdmin (which executes just select with LIMIT 1, 30), took around 6 seconds. Doing a order by on one of the other rows, it took a few minutes.. I tried both InnoDB and MyISAM for storage, but that didn't make much of a difference. So it depends on how large your docs are I guess.. - Matijn My docs are very small. Two hour meetings, 4 typed pages usually, so approx. 8K of real data each. I don't think storage is much of a concern here. The actual doc formats are around 28K and when converted to RTF they grow to 44K - still not very large. Will this be a concern? -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] storing searching docs
On Thu, Dec 13, 2012 at 3:32 PM, Jim Giner jim.gi...@albanyhandball.comwrote: On 12/13/2012 9:19 AM, Matijn Woudt wrote: On Thu, Dec 13, 2012 at 3:10 PM, Jim Giner jim.gi...@albanyhandball.com **wrote: I'm not sure if there's much difference between large text fields and blobs, but I had a database (MySQL) with rows that had one blob each of 5-10 mb. At around 200-300 rows the database was pretty slow. After reaching about 2000 rows, it was terrible. Opening the database with phpMyAdmin (which executes just select with LIMIT 1, 30), took around 6 seconds. Doing a order by on one of the other rows, it took a few minutes.. I tried both InnoDB and MyISAM for storage, but that didn't make much of a difference. So it depends on how large your docs are I guess.. - Matijn My docs are very small. Two hour meetings, 4 typed pages usually, so approx. 8K of real data each. I don't think storage is much of a concern here. The actual doc formats are around 28K and when converted to RTF they grow to 44K - still not very large. Will this be a concern? That of course also depends on how many you are planning on storing. I guess a few hundred will be ok, but after that I'm not so sure.. - Matijn
Re: [PHP] storing searching docs
Bastien Koert On 2012-12-13, at 9:10 AM, Jim Giner jim.gi...@albanyhandball.com wrote: Thanks for the input gentlemen. Two opposing viewpoints! I understand the concept of using files for the docs and a table to locate them and id them. But I am of the opinion that modern dbs are capable of handling very large objects (of which these docs are NOT!) much easier than years ago, so I am leaning that way still. It will certainly make my search process easier! More comments anyone? -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php I got away from storing blobs in the db. I noticed significant slowness after the db grew to about 12gb in MySQL. Back ups also get affected as they take longer. This was older MySQL. But it also affected my mssql server the same way. Nowadays it's files into the file system and data into the db. One thing you could consider is reading the contents of the into a db field and just store the text to allow the full text search Bastien -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] storing searching docs
On 12/13/2012 10:56 AM, Bastien wrote: Bastien Koert On 2012-12-13, at 9:10 AM, Jim Giner jim.gi...@albanyhandball.com wrote: Thanks for the input gentlemen. Two opposing viewpoints! I understand the concept of using files for the docs and a table to locate them and id them. But I am of the opinion that modern dbs are capable of handling very large objects (of which these docs are NOT!) much easier than years ago, so I am leaning that way still. It will certainly make my search process easier! More comments anyone? -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php I got away from storing blobs in the db. I noticed significant slowness after the db grew to about 12gb in MySQL. Back ups also get affected as they take longer. This was older MySQL. But it also affected my mssql server the same way. Nowadays it's files into the file system and data into the db. One thing you could consider is reading the contents of the into a db field and just store the text to allow the full text search Bastien A very clever idea! I like it - the best of both worlds. Can you sum up a method for getting the text out of the .doc (or .rtf) files so that I can automate the process for my past and future documents? Is there a single php function that would accomplish this? -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] storing searching docs
On Thu, Dec 13, 2012 at 5:13 PM, Jim Giner jim.gi...@albanyhandball.comwrote: On 12/13/2012 10:56 AM, Bastien wrote: Bastien Koert On 2012-12-13, at 9:10 AM, Jim Giner jim.gi...@albanyhandball.com wrote: Thanks for the input gentlemen. Two opposing viewpoints! I understand the concept of using files for the docs and a table to locate them and id them. But I am of the opinion that modern dbs are capable of handling very large objects (of which these docs are NOT!) much easier than years ago, so I am leaning that way still. It will certainly make my search process easier! More comments anyone? -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php I got away from storing blobs in the db. I noticed significant slowness after the db grew to about 12gb in MySQL. Back ups also get affected as they take longer. This was older MySQL. But it also affected my mssql server the same way. Nowadays it's files into the file system and data into the db. One thing you could consider is reading the contents of the into a db field and just store the text to allow the full text search Bastien A very clever idea! I like it - the best of both worlds. Can you sum up a method for getting the text out of the .doc (or .rtf) files so that I can automate the process for my past and future documents? Is there a single php function that would accomplish this? There's no builtin function for such stuff. doc files are quite tricky to parse, but rtf files can be parsed pretty easily. One project is PHPRtfLite [1], which provides you an API for doing this. - Matijn [1] http://sourceforge.net/projects/phprtf/
Re: [PHP] storing searching docs
On Thu, 2012-12-13 at 18:41 +0100, Matijn Woudt wrote: On Thu, Dec 13, 2012 at 5:13 PM, Jim Giner jim.gi...@albanyhandball.comwrote: On 12/13/2012 10:56 AM, Bastien wrote: Bastien Koert On 2012-12-13, at 9:10 AM, Jim Giner jim.gi...@albanyhandball.com wrote: Thanks for the input gentlemen. Two opposing viewpoints! I understand the concept of using files for the docs and a table to locate them and id them. But I am of the opinion that modern dbs are capable of handling very large objects (of which these docs are NOT!) much easier than years ago, so I am leaning that way still. It will certainly make my search process easier! More comments anyone? -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php I got away from storing blobs in the db. I noticed significant slowness after the db grew to about 12gb in MySQL. Back ups also get affected as they take longer. This was older MySQL. But it also affected my mssql server the same way. Nowadays it's files into the file system and data into the db. One thing you could consider is reading the contents of the into a db field and just store the text to allow the full text search Bastien A very clever idea! I like it - the best of both worlds. Can you sum up a method for getting the text out of the .doc (or .rtf) files so that I can automate the process for my past and future documents? Is there a single php function that would accomplish this? There's no builtin function for such stuff. doc files are quite tricky to parse, but rtf files can be parsed pretty easily. One project is PHPRtfLite [1], which provides you an API for doing this. - Matijn [1] http://sourceforge.net/projects/phprtf/ As well as rtf, the OpenDoc format is easy to read from PHP. Essentially it's just a bunch of XML files zipped up. Images are kept in the archive too, which is a handy way to retrieve thumbnails of docs also! Thanks, Ash http://www.ashleysheridan.co.uk
Re: [PHP] storing searching docs
On Thu, Dec 13, 2012 at 12:41 PM, Matijn Woudt tijn...@gmail.com wrote: On Thu, Dec 13, 2012 at 5:13 PM, Jim Giner jim.gi...@albanyhandball.comwrote: On 12/13/2012 10:56 AM, Bastien wrote: Bastien Koert On 2012-12-13, at 9:10 AM, Jim Giner jim.gi...@albanyhandball.com wrote: Thanks for the input gentlemen. Two opposing viewpoints! I understand the concept of using files for the docs and a table to locate them and id them. But I am of the opinion that modern dbs are capable of handling very large objects (of which these docs are NOT!) much easier than years ago, so I am leaning that way still. It will certainly make my search process easier! More comments anyone? -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php I got away from storing blobs in the db. I noticed significant slowness after the db grew to about 12gb in MySQL. Back ups also get affected as they take longer. This was older MySQL. But it also affected my mssql server the same way. Nowadays it's files into the file system and data into the db. One thing you could consider is reading the contents of the into a db field and just store the text to allow the full text search Bastien A very clever idea! I like it - the best of both worlds. Can you sum up a method for getting the text out of the .doc (or .rtf) files so that I can automate the process for my past and future documents? Is there a single php function that would accomplish this? There's no builtin function for such stuff. doc files are quite tricky to parse, but rtf files can be parsed pretty easily. One project is PHPRtfLite [1], which provides you an API for doing this. - Matijn [1] http://sourceforge.net/projects/phprtf/ There is http://stackoverflow.com/questions/188452/reading-writing-a-ms-word-file-in-php which has some discussion on reading those files with Antiword (http://www.winfield.demon.nl/) -- Bastien Cat, the other other white meat -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] storing searching docs
On 12/13/2012 2:40 PM, Bastien Koert wrote: On Thu, Dec 13, 2012 at 12:41 PM, Matijn Woudt tijn...@gmail.com wrote: On Thu, Dec 13, 2012 at 5:13 PM, Jim Giner jim.gi...@albanyhandball.comwrote: On 12/13/2012 10:56 AM, Bastien wrote: Bastien Koert On 2012-12-13, at 9:10 AM, Jim Giner jim.gi...@albanyhandball.com wrote: Thanks for the input gentlemen. Two opposing viewpoints! I understand the concept of using files for the docs and a table to locate them and id them. But I am of the opinion that modern dbs are capable of handling very large objects (of which these docs are NOT!) much easier than years ago, so I am leaning that way still. It will certainly make my search process easier! More comments anyone? -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php I got away from storing blobs in the db. I noticed significant slowness after the db grew to about 12gb in MySQL. Back ups also get affected as they take longer. This was older MySQL. But it also affected my mssql server the same way. Nowadays it's files into the file system and data into the db. One thing you could consider is reading the contents of the into a db field and just store the text to allow the full text search Bastien A very clever idea! I like it - the best of both worlds. Can you sum up a method for getting the text out of the .doc (or .rtf) files so that I can automate the process for my past and future documents? Is there a single php function that would accomplish this? There's no builtin function for such stuff. doc files are quite tricky to parse, but rtf files can be parsed pretty easily. One project is PHPRtfLite [1], which provides you an API for doing this. - Matijn [1] http://sourceforge.net/projects/phprtf/ There is http://stackoverflow.com/questions/188452/reading-writing-a-ms-word-file-in-php which has some discussion on reading those files with Antiword (http://www.winfield.demon.nl/) But I can't get antiword. I'm running windows while my host is running linux. And there aren't any linux binaries available for download to put onto my host (assuming that I could do that!). Or am I missing something. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] storing searching docs
Thanks for all the posts. After reading and googling all afternoon, I think the best approach for me is: Create two macros in Word (done!) to export each of my .doc files to .txt and .pdf formats. Create a sql table to hold the .txt contents of my .doc files, along with a reference to the meeting date and the name of the corresponding .pdf file. Upload my two sets of files with an ftp client and then use a script to load the table with my .txt file data. Now I just need a couple of scripts to allow a user to locate a file and bring up the pdf for when he wants to read about a meeting. And a second script to accept user input (search words) and perform a query against the textual data and present some kind of results - probably a listing containing a reference to the meeting date and a tbd-length string showing the matching result for each occurrence, ie, something like n chars in front of and after the match so the user can see the context of the match. Sizes - a 28k .doc file grows to 142kb in .pdf format and is only 5kb in .txt format. (actually, if I 'print' the .doc as a pdf instead of using the Word's File,Save as, the resulting pdf is only 70kb. Might need a new macro!) Thanks again! -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] storing searching docs
On 12/13/2012 02:49 PM, Jim Giner wrote: Thanks for all the posts. After reading and googling all afternoon, I think the best approach for me is: Create two macros in Word (done!) to export each of my .doc files to .txt and .pdf formats. Create a sql table to hold the .txt contents of my .doc files, along with a reference to the meeting date and the name of the corresponding .pdf file. Upload my two sets of files with an ftp client and then use a script to load the table with my .txt file data. Now I just need a couple of scripts to allow a user to locate a file and bring up the pdf for when he wants to read about a meeting. And a second script to accept user input (search words) and perform a query against the textual data and present some kind of results - probably a listing containing a reference to the meeting date and a tbd-length string showing the matching result for each occurrence, ie, something like n chars in front of and after the match so the user can see the context of the match. Sizes - a 28k .doc file grows to 142kb in .pdf format and is only 5kb in .txt format. (actually, if I 'print' the .doc as a pdf instead of using the Word's File,Save as, the resulting pdf is only 70kb. Might need a new macro!) Thanks again! I wrote this script a few years ago that extracted the plain text out of the .doc file. http://www.cmsws.com/examples/applications/word2_/convert.php if you look in the directory you will see a few example files. You can view them like this. .../convert.php?filename=test_building.doc replace test_building.doc with any of the other .doc files from the dir listing to see its contents. I currently have it set to 64bit width rows. Show you some nice pattern stuff with the MS Word format. I have the source file viewable for the convert.php script as well. http://www.cmsws.com/examples/applications/word2_/convert.phps I have thought about extending this even further to figure out the layout and test formatting. But it hasn't gotten much attention for quite some time now. Hope it helps. -- Jim Lucas http://www.cmsws.com/ http://www.cmsws.com/examples/ -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
[PHP] storing searching docs
Slightly off-topic perhaps but I'm looking for general input here. New idea for a project - save the minutes of my firehouse meetings into a mysql table and build a ui to search them for words and such. The docs are written in Word currently. My simplistic idea is to perhaps convert them to something other than Word format and then to store them into a field of a mysql record with the meeting date as a key field. Of course having them online I should also allow for viewing as a document in something close to their original (?) format. Any ideas - pro or con - on this idea? -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] storing searching docs
On Wed, Dec 12, 2012 at 01:00:41PM -0500, Jim Giner wrote: Slightly off-topic perhaps but I'm looking for general input here. New idea for a project - save the minutes of my firehouse meetings into a mysql table and build a ui to search them for words and such. The docs are written in Word currently. My simplistic idea is to perhaps convert them to something other than Word format and then to store them into a field of a mysql record with the meeting date as a key field. Of course having them online I should also allow for viewing as a document in something close to their original (?) format. Any ideas - pro or con - on this idea? First off, I'd convert them to RTF (rich text format). Word format is too ephemeral ( = self-incompatible). RTF is a lowest common denomenator which can be converted to a variety of other formats. And RTF is a standardized format that both Word and things like Open Office both understand. The formatting for meeting minutes don't dictate a very complicated layout (something that RTF isn't that good with). I would suggest HTML format, but Word is notoriously atrocious at faithfully converting its own formats into HTML. The result is horrid. Second, you've hit on one of my pet peeves. Never never store huge blocks of text in SQL files. It slows them down and there's no real reason for it. There's no reason to force a DBMS to schlep around massive clumps of text or binary data. That's what disk file systems are for. Store the target data in a file and store a reference to the location of the data in the SQL database. Or perhaps, use a NoSQL solution. I don't know much about the internals of nosql systems, but I would hope that the metadata about the text objects would be stored separately from the payload (text object). Paul -- Paul M. Foster http://noferblatz.com http://quillandmouse.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] storing searching docs
On 12-12-2012 21:03, Paul M Foster wrote: Second, you've hit on one of my pet peeves. Never never store huge blocks of text in SQL files. It slows them down and there's no real reason for it. There's no reason to force a DBMS to schlep around massive clumps of text or binary data. That's what disk file systems are for. Store the target data in a file and store a reference to the location of the data in the SQL database. Or perhaps, use a NoSQL solution. I don't know much about the internals of nosql systems, but I would hope that the metadata about the text objects would be stored separately from the payload (text object). Paul I actually disagree on this point. In the past, storing data in a database would make the entire database-system extremely slow and would eat up memory. These days, most database-systems can be (or even are) optimized to actually not do this anymore. One positive aspect of storing such data in a database is the ability to search using full-text searches. For example, you could use the Sphinx Search Engine, which integrates into MySQL very well. It makes searching for specific words, phrases, etc. very simple and VERY fast. So in this case, storing it in a database WOULD actually be a good idea IMO. - Tul -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] storing searching docs
On 12-12-2012 21:40, Maciek Sokolewicz wrote: On 12-12-2012 21:03, Paul M Foster wrote: Second, you've hit on one of my pet peeves. Never never store huge blocks of text in SQL files. It slows them down and there's no real reason for it. There's no reason to force a DBMS to schlep around massive clumps of text or binary data. That's what disk file systems are for. Store the target data in a file and store a reference to the location of the data in the SQL database. Or perhaps, use a NoSQL solution. I don't know much about the internals of nosql systems, but I would hope that the metadata about the text objects would be stored separately from the payload (text object). Paul I actually disagree on this point. In the past, storing data in a database would make the entire database-system extremely slow and would eat up memory. These days, most database-systems can be (or even are) optimized to actually not do this anymore. One positive aspect of storing such data in a database is the ability to search using full-text searches. For example, you could use the Sphinx Search Engine, which integrates into MySQL very well. It makes searching for specific words, phrases, etc. very simple and VERY fast. So in this case, storing it in a database WOULD actually be a good idea IMO. - Tul Actually, I have to come back on that one. You could also store it locally in files, and feed it into the searchd daemon manually. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php