Re: MySQL For Huge Collections
Hi All, In this case how the images of a book will be stored, a chapter may contain number of images with different size. Or It deals only text? Thanks. Vikram A From: Jerry Schwartz je...@gii.co.jp To: Andy listan...@gmail.com; mysql@lists.mysql.com Sent: Fri, 11 June, 2010 9:05:26 PM Subject: RE: MySQL For Huge Collections -Original Message- From: Andy [mailto:listan...@gmail.com] Sent: Friday, June 11, 2010 8:09 AM To: mysql@lists.mysql.com Subject: Re: MySQL For Huge Collections Hello all, Thanks much for your replies. OK, so I realized that I may not have explained the problem clearly enough. I will try to do it now. I am a researcher in computational linguistics, and I am trying to research language usage and writing styles across different genres of books over the years. The system I am developing is not just to serve up e-book content (that will happen later possibly) but to help me analyze at micro-level the different constituent elements of a book ( say at chapter level or paragraph level). As part of this work, I need to break-up, store and repeatedly run queries across multiple e-books. Here are several additional sample queries: * give me books that use the word ABC * give me the first 10 pages of e-book XYZ * give me chapter 1 of all e-books [JS] You pose an interesting challenge. Normally, my choice is to store big things as normal files and maintain the index (with accompanying descriptive information) in the database. You've probably seen systems like this, where you assign tags to pictures. That would certainly handle the second two cases (with some ancillary programming, of course). Your first example is a bigger challenge. MySQL can do full text searches, but from what I've read they can get painfully slow. I never encountered that problem, but my databases are rather small (~10 rows). For this technique, you would want to store all of your text in LONGTEXT columns. I've also read that there are plug-ins that do the same thing, only faster. I'm not sure how you would define a page of an e-book, and I suspect you would also deal with individual paragraphs or lines. My suggestion for that would be to have a book table, with such things as the title and author and perhaps ISBN; a page table identifying which paragraphs are on which page (for a given book); a paragraph table identifying which lines are in which paragraph; and then a lines table that contains the actual text of each line. [book1, title, ...] - [book1, para1] - [para1, line1, linetext] [book2, title, ...] [book1, para2] [para1, line2, linetext] [book3, title, ...] [book1, para3] [para1, line3, linetext] ... [book1, para4] [para1, line4, linetext] ...[para1, line5, linetext] ... This would let you have a full text index on the titles, and another on the linetext, with a number of ways to limit your searches. Because the linetext field would be relatively short, the search should be relatively fast even though there might be a relatively large number of records returned if you wanted to search entire books. NOTE: Small test cases might yield surprising results because of the way full text searches determine relevancy! This has bitten me more than once. This was fun, I hope my suggestions make sense. Regards, Jerry Schwartz Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 www.the-infoshop.com -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=vikkiatb...@yahoo.in
RE: MySQL For Huge Collections
From: Vikram A [mailto:vikkiatb...@yahoo.in] Sent: Wednesday, June 16, 2010 2:58 AM To: je...@gii.co.jp; Andy; mysql@lists.mysql.com Subject: Re: MySQL For Huge Collections Hi All, In this case how the images of a book will be stored, a chapter may contain number of images with different size. Or It deals only text? [JS] I was only thinking about text, but you can extend the idea to handle images by adding another table. Let’s assume that you want to associate each image with a line. Just add a table with a blob field in each record. Put the image in the blob and link it to the nearest line. A line record could link to any number of images, from zero to infinity. The image table would have to have a lot more information than just the blob and the line number, of course. You’d need all kinds of page layout information for presentation purposes: is the image in line with the text, on the left, on the right, in the middle, below the text, etc. This is getting very complicated. If you’re going to have images, then you can’t be starting with plain text. Depending upon the format of the original data, you might consider storing everything as HTML. That would make it somewhat more complicated to detect line boundaries, but it would preserve the layout for eventual presentation. You’ve just complicated the whole process enormously. Regards, Jerry Schwartz Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 www.the-infoshop.com Thanks. Vikram A _ From: Jerry Schwartz je...@gii.co.jp To: Andy listan...@gmail.com; mysql@lists.mysql.com Sent: Fri, 11 June, 2010 9:05:26 PM Subject: RE: MySQL For Huge Collections -Original Message- From: Andy [mailto:listan...@gmail.com] Sent: Friday, June 11, 2010 8:09 AM To: mysql@lists.mysql.com Subject: Re: MySQL For Huge Collections Hello all, Thanks much for your replies. OK, so I realized that I may not have explained the problem clearly enough. I will try to do it now. I am a researcher in computational linguistics, and I am trying to research language usage and writing styles across different genres of books over the years. The system I am developing is not just to serve up e-book content (that will happen later possibly) but to help me analyze at micro-level the different constituent elements of a book ( say at chapter level or paragraph level). As part of this work, I need to break-up, store and repeatedly run queries across multiple e-books. Here are several additional sample queries: * give me books that use the word ABC * give me the first 10 pages of e-book XYZ * give me chapter 1 of all e-books [JS] You pose an interesting challenge. Normally, my choice is to store big things as normal files and maintain the index (with accompanying descriptive information) in the database. You've probably seen systems like this, where you assign tags to pictures. That would certainly handle the second two cases (with some ancillary programming, of course). Your first example is a bigger challenge. MySQL can do full text searches, but from what I've read they can get painfully slow. I never encountered that problem, but my databases are rather small (~10 rows). For this technique, you would want to store all of your text in LONGTEXT columns. I've also read that there are plug-ins that do the same thing, only faster. I'm not sure how you would define a page of an e-book, and I suspect you would also deal with individual paragraphs or lines. My suggestion for that would be to have a book table, with such things as the title and author and perhaps ISBN; a page table identifying which paragraphs are on which page (for a given book); a paragraph table identifying which lines are in which paragraph; and then a lines table that contains the actual text of each line. [book1, title, ...] - [book1, para1] - [para1, line1, linetext] [book2, title, ...][book1, para2][para1, line2, linetext] [book3, title, ...][book1, para3][para1, line3, linetext] ...[book1, para4][para1, line4, linetext] ...[para1, line5, linetext] ... This would let you have a full text index on the titles, and another on the linetext, with a number of ways to limit your searches. Because the linetext field would be relatively short, the search should be relatively fast even though there might be a relatively large number of records returned if you wanted to search entire books. NOTE: Small test cases might yield surprising results because of the way full text searches determine relevancy! This has bitten me more than once. This was fun, I hope my suggestions make sense. Regards, Jerry Schwartz Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 www.the-infoshop.com -- MySQL General Mailing
Re: MySQL For Huge Collections
Hello all, Thanks much for your replies. OK, so I realized that I may not have explained the problem clearly enough. I will try to do it now. I am a researcher in computational linguistics, and I am trying to research language usage and writing styles across different genres of books over the years. The system I am developing is not just to serve up e-book content (that will happen later possibly) but to help me analyze at micro-level the different constituent elements of a book ( say at chapter level or paragraph level). As part of this work, I need to break-up, store and repeatedly run queries across multiple e-books. Here are several additional sample queries: * give me books that use the word ABC * give me the first 10 pages of e-book XYZ * give me chapter 1 of all e-books Definitely, at a later stage when I start making my research available to the community, I will need to be able to provide fulltext (or chapter-wise) search also to the users, among other things. Please let me know if you have additional comments. Andy On Thu, Jun 10, 2010 at 9:05 PM, Peter Chacko peterchack...@gmail.comwrote: Usually, you better use a NAS for such purpose. Database is designed to store highly transactional, record oriented storage that needs fast access... You can look for any Enterprise content management systems that rest its storage on a scalable NAS, with file virtualization in the long run. thanks On Fri, Jun 11, 2010 at 8:04 AM, SHAWN L.GREEN shawn.l.gr...@oracle.com wrote: On 6/10/2010 10:16 PM, Andy wrote: Hello all, I am new to MySQL and am exploring the possibility of using it for my work. I have about ~300,000 e-books, each about 100 pages long. I am first going to extract each chapter from each e-book and then basically store an e-book as a collection of chapters. A chapter could of course be arbitrarily long depending on the book. My questions are: (1) Can MySQL handle data of this size? (2) How can I store text (contents) of each chapter? What data type will be appropriate? longtext? (3) I only envision running queries to extract a specific chapter from a specific e-book (say extract the chapter titled ABC from e-book number XYZ (or e-book titled XYZ)). Can MySQL handle these types of queries well on data of this size? (4) What are the benefits/drawbacks of using MySQL compared to using XML databases? I look forward to help on this topic. Many thanks in advance. Andy Always pick the right tool for the job. MySQL may not be the best tool for serving up eBook contents. However if you want to index and locate contents based on various parameters, then it may be a good fit for the purpose. Your simple queries would best be handled by a basic web server or FTP server because you seem to want http://your.site.here/ABC/xyz where ABC is your book and xyz is your chapter. Those types of technology are VERY well suited for managing the repetitive streaming and distribution of large binary objects (chapter files) like you might encounter with an eBook content delivery system. -- Shawn Green MySQL Principle Technical Support Engineer Oracle USA, Inc. Office: Blountville, TN -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/mysql?unsub=peterchack...@gmail.com
RE: MySQL For Huge Collections
-Original Message- From: Andy [mailto:listan...@gmail.com] Sent: Friday, June 11, 2010 8:09 AM To: mysql@lists.mysql.com Subject: Re: MySQL For Huge Collections Hello all, Thanks much for your replies. OK, so I realized that I may not have explained the problem clearly enough. I will try to do it now. I am a researcher in computational linguistics, and I am trying to research language usage and writing styles across different genres of books over the years. The system I am developing is not just to serve up e-book content (that will happen later possibly) but to help me analyze at micro-level the different constituent elements of a book ( say at chapter level or paragraph level). As part of this work, I need to break-up, store and repeatedly run queries across multiple e-books. Here are several additional sample queries: * give me books that use the word ABC * give me the first 10 pages of e-book XYZ * give me chapter 1 of all e-books [JS] You pose an interesting challenge. Normally, my choice is to store big things as normal files and maintain the index (with accompanying descriptive information) in the database. You've probably seen systems like this, where you assign tags to pictures. That would certainly handle the second two cases (with some ancillary programming, of course). Your first example is a bigger challenge. MySQL can do full text searches, but from what I've read they can get painfully slow. I never encountered that problem, but my databases are rather small (~10 rows). For this technique, you would want to store all of your text in LONGTEXT columns. I've also read that there are plug-ins that do the same thing, only faster. I'm not sure how you would define a page of an e-book, and I suspect you would also deal with individual paragraphs or lines. My suggestion for that would be to have a book table, with such things as the title and author and perhaps ISBN; a page table identifying which paragraphs are on which page (for a given book); a paragraph table identifying which lines are in which paragraph; and then a lines table that contains the actual text of each line. [book1, title, ...] - [book1, para1] - [para1, line1, linetext] [book2, title, ...] [book1, para2] [para1, line2, linetext] [book3, title, ...] [book1, para3] [para1, line3, linetext] ... [book1, para4] [para1, line4, linetext] ...[para1, line5, linetext] ... This would let you have a full text index on the titles, and another on the linetext, with a number of ways to limit your searches. Because the linetext field would be relatively short, the search should be relatively fast even though there might be a relatively large number of records returned if you wanted to search entire books. NOTE: Small test cases might yield surprising results because of the way full text searches determine relevancy! This has bitten me more than once. This was fun, I hope my suggestions make sense. Regards, Jerry Schwartz Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 www.the-infoshop.com -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
RE: MySQL For Huge Collections
Agreed. Consider keeping meta data about the book in your mysql database, but storing and serving the actual files from somewhere else. If I were you, I'd use an external full text search engine like Sphinx or Lucene to handle something like searching for content inside the book. Also, in terms of requirements, 300k books doesn't say a lot. Looking at project Gutenberg, I see that an uncompressed text copy of Sherlock Holmes is only 500k, so you're talking about maybe 150G of data -- which is pretty moderate. Sounds like a fun project though, good luck! Regards, Gavin Towey -Original Message- From: Peter Chacko [mailto:peterchack...@gmail.com] Sent: Thursday, June 10, 2010 9:05 PM To: SHAWN L.GREEN Cc: Andy; mysql@lists.mysql.com Subject: Re: MySQL For Huge Collections Usually, you better use a NAS for such purpose. Database is designed to store highly transactional, record oriented storage that needs fast access... You can look for any Enterprise content management systems that rest its storage on a scalable NAS, with file virtualization in the long run. thanks On Fri, Jun 11, 2010 at 8:04 AM, SHAWN L.GREEN shawn.l.gr...@oracle.com wrote: On 6/10/2010 10:16 PM, Andy wrote: Hello all, I am new to MySQL and am exploring the possibility of using it for my work. I have about ~300,000 e-books, each about 100 pages long. I am first going to extract each chapter from each e-book and then basically store an e-book as a collection of chapters. A chapter could of course be arbitrarily long depending on the book. My questions are: (1) Can MySQL handle data of this size? (2) How can I store text (contents) of each chapter? What data type will be appropriate? longtext? (3) I only envision running queries to extract a specific chapter from a specific e-book (say extract the chapter titled ABC from e-book number XYZ (or e-book titled XYZ)). Can MySQL handle these types of queries well on data of this size? (4) What are the benefits/drawbacks of using MySQL compared to using XML databases? I look forward to help on this topic. Many thanks in advance. Andy Always pick the right tool for the job. MySQL may not be the best tool for serving up eBook contents. However if you want to index and locate contents based on various parameters, then it may be a good fit for the purpose. Your simple queries would best be handled by a basic web server or FTP server because you seem to want http://your.site.here/ABC/xyz where ABC is your book and xyz is your chapter. Those types of technology are VERY well suited for managing the repetitive streaming and distribution of large binary objects (chapter files) like you might encounter with an eBook content delivery system. -- Shawn Green MySQL Principle Technical Support Engineer Oracle USA, Inc. Office: Blountville, TN -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/mysql?unsub=peterchack...@gmail.com -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=gto...@ffn.com This message contains confidential information and is intended only for the individual named. If you are not the named addressee, you are notified that reviewing, disseminating, disclosing, copying or distributing this e-mail is strictly prohibited. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any loss or damage caused by viruses or errors or omissions in the contents of this message, which arise as a result of e-mail transmission. [FriendFinder Networks, Inc., 220 Humbolt court, Sunnyvale, CA 94089, USA, FriendFinder.com -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
MySQL For Huge Collections
Hello all, I am new to MySQL and am exploring the possibility of using it for my work. I have about ~300,000 e-books, each about 100 pages long. I am first going to extract each chapter from each e-book and then basically store an e-book as a collection of chapters. A chapter could of course be arbitrarily long depending on the book. My questions are: (1) Can MySQL handle data of this size? (2) How can I store text (contents) of each chapter? What data type will be appropriate? longtext? (3) I only envision running queries to extract a specific chapter from a specific e-book (say extract the chapter titled ABC from e-book number XYZ (or e-book titled XYZ)). Can MySQL handle these types of queries well on data of this size? (4) What are the benefits/drawbacks of using MySQL compared to using XML databases? I look forward to help on this topic. Many thanks in advance. Andy
Re: MySQL For Huge Collections
On 6/10/2010 10:16 PM, Andy wrote: Hello all, I am new to MySQL and am exploring the possibility of using it for my work. I have about ~300,000 e-books, each about 100 pages long. I am first going to extract each chapter from each e-book and then basically store an e-book as a collection of chapters. A chapter could of course be arbitrarily long depending on the book. My questions are: (1) Can MySQL handle data of this size? (2) How can I store text (contents) of each chapter? What data type will be appropriate? longtext? (3) I only envision running queries to extract a specific chapter from a specific e-book (say extract the chapter titled ABC from e-book number XYZ (or e-book titled XYZ)). Can MySQL handle these types of queries well on data of this size? (4) What are the benefits/drawbacks of using MySQL compared to using XML databases? I look forward to help on this topic. Many thanks in advance. Andy Always pick the right tool for the job. MySQL may not be the best tool for serving up eBook contents. However if you want to index and locate contents based on various parameters, then it may be a good fit for the purpose. Your simple queries would best be handled by a basic web server or FTP server because you seem to want http://your.site.here/ABC/xyz where ABC is your book and xyz is your chapter. Those types of technology are VERY well suited for managing the repetitive streaming and distribution of large binary objects (chapter files) like you might encounter with an eBook content delivery system. -- Shawn Green MySQL Principle Technical Support Engineer Oracle USA, Inc. Office: Blountville, TN -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
Re: MySQL For Huge Collections
Usually, you better use a NAS for such purpose. Database is designed to store highly transactional, record oriented storage that needs fast access... You can look for any Enterprise content management systems that rest its storage on a scalable NAS, with file virtualization in the long run. thanks On Fri, Jun 11, 2010 at 8:04 AM, SHAWN L.GREEN shawn.l.gr...@oracle.com wrote: On 6/10/2010 10:16 PM, Andy wrote: Hello all, I am new to MySQL and am exploring the possibility of using it for my work. I have about ~300,000 e-books, each about 100 pages long. I am first going to extract each chapter from each e-book and then basically store an e-book as a collection of chapters. A chapter could of course be arbitrarily long depending on the book. My questions are: (1) Can MySQL handle data of this size? (2) How can I store text (contents) of each chapter? What data type will be appropriate? longtext? (3) I only envision running queries to extract a specific chapter from a specific e-book (say extract the chapter titled ABC from e-book number XYZ (or e-book titled XYZ)). Can MySQL handle these types of queries well on data of this size? (4) What are the benefits/drawbacks of using MySQL compared to using XML databases? I look forward to help on this topic. Many thanks in advance. Andy Always pick the right tool for the job. MySQL may not be the best tool for serving up eBook contents. However if you want to index and locate contents based on various parameters, then it may be a good fit for the purpose. Your simple queries would best be handled by a basic web server or FTP server because you seem to want http://your.site.here/ABC/xyz where ABC is your book and xyz is your chapter. Those types of technology are VERY well suited for managing the repetitive streaming and distribution of large binary objects (chapter files) like you might encounter with an eBook content delivery system. -- Shawn Green MySQL Principle Technical Support Engineer Oracle USA, Inc. Office: Blountville, TN -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/mysql?unsub=peterchack...@gmail.com -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org