Re: MySQL For Huge Collections

2010-06-16 Thread Vikram A
Hi All,

In this case how the images of a book will be stored, a chapter may contain 
number of images with different size.

Or It deals only text?


Thanks.

Vikram A



From: Jerry Schwartz je...@gii.co.jp
To: Andy listan...@gmail.com; mysql@lists.mysql.com
Sent: Fri, 11 June, 2010 9:05:26 PM
Subject: RE: MySQL For Huge Collections

-Original Message-
From: Andy [mailto:listan...@gmail.com]
Sent: Friday, June 11, 2010 8:09 AM
To: mysql@lists.mysql.com
Subject: Re: MySQL For Huge Collections

Hello all,

Thanks much for your replies.

OK, so I realized that I may not have explained the problem clearly enough.
I will try to do it now.

I am a researcher in computational linguistics, and I am trying to research
language usage and writing styles across different genres of books over the
years. The system I am developing is not just to serve up e-book content
(that will happen later possibly) but to help me analyze at micro-level the
different constituent elements of a book ( say at chapter level or paragraph
level). As part of this work, I need to break-up, store and repeatedly run
queries across multiple e-books. Here are several additional sample queries:

* give me books that use the word ABC
* give me the first 10 pages of e-book XYZ
* give me chapter 1 of all e-books

[JS] You pose an interesting challenge. Normally, my choice is to store big 
things as normal files and maintain the index (with accompanying descriptive 
information) in the database. You've probably seen systems like this, where 
you assign tags to pictures. That would certainly handle the second two 
cases (with some ancillary programming, of course).

Your first example is a bigger challenge. MySQL can do full text searches, but 
from what I've read they can get painfully slow. I never encountered that 
problem, but my databases are rather small (~10 rows). For this technique, 
you would want to store all of your text in LONGTEXT columns.

I've also read that there are plug-ins that do the same thing, only faster.

I'm not sure how you would define a page of an e-book, and I suspect you 
would also deal with individual paragraphs or lines. My suggestion for that 
would be to have a book table, with such things as the title and author and 
perhaps ISBN; a page table identifying which paragraphs are on which page 
(for a given book); a paragraph table identifying which lines are in which 
paragraph; and then a lines table that contains the actual text of each 
line.

[book1, title, ...] - [book1, para1] - [para1, line1, linetext]
[book2, title, ...] [book1, para2] [para1, line2, linetext]
[book3, title, ...] [book1, para3] [para1, line3, linetext]
... [book1, para4] [para1, line4, linetext]
...[para1, line5, linetext]
   ...

This would let you have a full text index on the titles, and another on the 
linetext, with a number of ways to limit your searches. Because the linetext 
field would be relatively short, the search should be relatively fast even 
though there might be a relatively large number of records returned if you 
wanted to search entire books.

NOTE: Small test cases might yield surprising results because of the way full 
text searches determine relevancy! This has bitten me more than once.

This was fun, I hope my suggestions make sense.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341

www.the-infoshop.com




-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=vikkiatb...@yahoo.in



RE: MySQL For Huge Collections

2010-06-16 Thread Jerry Schwartz
From: Vikram A [mailto:vikkiatb...@yahoo.in] 
Sent: Wednesday, June 16, 2010 2:58 AM
To: je...@gii.co.jp; Andy; mysql@lists.mysql.com
Subject: Re: MySQL For Huge Collections

 

Hi All,

In this case how the images of a book will be stored, a chapter may contain 
number of images with different size.

Or It deals only text?

[JS] I was only thinking about text, but you can extend the idea to handle 
images by adding another table. Let’s assume that you want to associate each 
image with a line. Just add a table with a blob field in each record. Put the 
image in the blob and link it to the nearest line. A line record could link to 
any number of images, from zero to infinity.

The image table would have to have a lot more information than just the blob 
and the line number, of course. You’d need all kinds of page layout information 
for presentation purposes: is the image in line with the text, on the left, on 
the right, in the middle, below the text, etc. This is getting very complicated.

If you’re going to have images, then you can’t be starting with plain text. 
Depending upon the format of the original data, you might consider storing 
everything as HTML. That would make it somewhat more complicated to detect line 
boundaries, but it would preserve the layout for eventual presentation. You’ve 
just complicated the whole process enormously.

 

Regards,

 

Jerry Schwartz

Global Information Incorporated

195 Farmington Ave.

Farmington, CT 06032

 

860.674.8796 / FAX: 860.674.8341

 

www.the-infoshop.com

 

 

Thanks.

Vikram A

  _  

From: Jerry Schwartz je...@gii.co.jp
To: Andy listan...@gmail.com; mysql@lists.mysql.com
Sent: Fri, 11 June, 2010 9:05:26 PM
Subject: RE: MySQL For Huge Collections

-Original Message-
From: Andy [mailto:listan...@gmail.com]
Sent: Friday, June 11, 2010 8:09 AM
To: mysql@lists.mysql.com
Subject: Re: MySQL For Huge Collections

Hello all,

Thanks much for your replies.

OK, so I realized that I may not have explained the problem clearly enough.
I will try to do it now.

I am a researcher in computational linguistics, and I am trying to research
language usage and writing styles across different genres of books over the
years. The system I am developing is not just to serve up e-book content
(that will happen later possibly) but to help me analyze at micro-level the
different constituent elements of a book ( say at chapter level or paragraph
level). As part of this work, I need to break-up, store and repeatedly run
queries across multiple e-books. Here are several additional sample queries:

* give me books that use the word ABC
* give me the first 10 pages of e-book XYZ
* give me chapter 1 of all e-books

[JS] You pose an interesting challenge. Normally, my choice is to store big 
things as normal files and maintain the index (with accompanying descriptive 
information) in the database. You've probably seen systems like this, where 
you assign tags to pictures. That would certainly handle the second two 
cases (with some ancillary programming, of course).

Your first example is a bigger challenge. MySQL can do full text searches, but 
from what I've read they can get painfully slow. I never encountered that 
problem, but my databases are rather small (~10 rows). For this technique, 
you would want to store all of your text in LONGTEXT columns.

I've also read that there are plug-ins that do the same thing, only faster.

I'm not sure how you would define a page of an e-book, and I suspect you 
would also deal with individual paragraphs or lines. My suggestion for that 
would be to have a book table, with such things as the title and author and 
perhaps ISBN; a page table identifying which paragraphs are on which page 
(for a given book); a paragraph table identifying which lines are in which 
paragraph; and then a lines table that contains the actual text of each 
line.

[book1, title, ...] - [book1, para1] - [para1, line1, linetext]
[book2, title, ...][book1, para2][para1, line2, linetext]
[book3, title, ...][book1, para3][para1, line3, linetext]
...[book1, para4][para1, line4, linetext]
...[para1, line5, linetext]
  ...

This would let you have a full text index on the titles, and another on the 
linetext, with a number of ways to limit your searches. Because the linetext 
field would be relatively short, the search should be relatively fast even 
though there might be a relatively large number of records returned if you 
wanted to search entire books.

NOTE: Small test cases might yield surprising results because of the way full 
text searches determine relevancy! This has bitten me more than once.

This was fun, I hope my suggestions make sense.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341

www.the-infoshop.com




-- 
MySQL General Mailing

Re: MySQL For Huge Collections

2010-06-11 Thread Andy
Hello all,

Thanks much for your replies.

OK, so I realized that I may not have explained the problem clearly enough.
I will try to do it now.

I am a researcher in computational linguistics, and I am trying to research
language usage and writing styles across different genres of books over the
years. The system I am developing is not just to serve up e-book content
(that will happen later possibly) but to help me analyze at micro-level the
different constituent elements of a book ( say at chapter level or paragraph
level). As part of this work, I need to break-up, store and repeatedly run
queries across multiple e-books. Here are several additional sample queries:

* give me books that use the word ABC
* give me the first 10 pages of e-book XYZ
* give me chapter 1 of all e-books

Definitely, at a later stage when I start making my research available to
the community, I will need to be able to provide fulltext (or chapter-wise)
search also to the users, among other things.

Please let me know if you have additional comments.

Andy



On Thu, Jun 10, 2010 at 9:05 PM, Peter Chacko peterchack...@gmail.comwrote:

 Usually, you better use a NAS for such purpose. Database is designed
 to store highly transactional, record oriented storage that needs fast
 access... You can look for any Enterprise content management systems
 that rest its storage on a scalable NAS, with file virtualization in
 the long run.

 thanks

 On Fri, Jun 11, 2010 at 8:04 AM, SHAWN L.GREEN shawn.l.gr...@oracle.com
 wrote:
  On 6/10/2010 10:16 PM, Andy wrote:
 
  Hello all,
 
  I am new to MySQL and am exploring the possibility of using it for my
  work.
  I have about ~300,000 e-books, each about 100 pages long. I am first
 going
  to extract each chapter from each e-book and then basically store an
  e-book
  as a collection of chapters. A chapter could of course be arbitrarily
 long
  depending on the book.
 
  My questions are:
 
  (1) Can MySQL handle data of this size?
  (2) How can I store text (contents) of each chapter? What data type will
  be
  appropriate? longtext?
  (3) I only envision running queries to extract a specific chapter from a
  specific e-book (say extract the chapter titled ABC from e-book number
  XYZ
  (or e-book titled XYZ)). Can MySQL handle these types of queries well
 on
  data of this size?
  (4) What are the benefits/drawbacks of using MySQL compared to using XML
  databases?
 
  I look forward to help on this topic. Many thanks in advance.
  Andy
 
 
  Always pick the right tool for the job.
 
  MySQL may not be the best tool for serving up eBook contents. However if
 you
  want to index and locate contents based on various parameters, then it
 may
  be a good fit for the purpose.
 
  Your simple queries would best be handled by a basic web server or FTP
  server because you seem to want
 
  http://your.site.here/ABC/xyz
 
  where ABC is your book and xyz is your chapter.
 
  Those types of technology are VERY well suited for managing the
 repetitive
  streaming and distribution of large binary objects (chapter files) like
 you
  might encounter with an eBook content delivery system.
 
  --
  Shawn Green
  MySQL Principle Technical Support Engineer
  Oracle USA, Inc.
  Office: Blountville, TN
 
  --
  MySQL General Mailing List
  For list archives: http://lists.mysql.com/mysql
  To unsubscribe:
   http://lists.mysql.com/mysql?unsub=peterchack...@gmail.com
 
 



RE: MySQL For Huge Collections

2010-06-11 Thread Jerry Schwartz
-Original Message-
From: Andy [mailto:listan...@gmail.com]
Sent: Friday, June 11, 2010 8:09 AM
To: mysql@lists.mysql.com
Subject: Re: MySQL For Huge Collections

Hello all,

Thanks much for your replies.

OK, so I realized that I may not have explained the problem clearly enough.
I will try to do it now.

I am a researcher in computational linguistics, and I am trying to research
language usage and writing styles across different genres of books over the
years. The system I am developing is not just to serve up e-book content
(that will happen later possibly) but to help me analyze at micro-level the
different constituent elements of a book ( say at chapter level or paragraph
level). As part of this work, I need to break-up, store and repeatedly run
queries across multiple e-books. Here are several additional sample queries:

* give me books that use the word ABC
* give me the first 10 pages of e-book XYZ
* give me chapter 1 of all e-books

[JS] You pose an interesting challenge. Normally, my choice is to store big 
things as normal files and maintain the index (with accompanying descriptive 
information) in the database. You've probably seen systems like this, where 
you assign tags to pictures. That would certainly handle the second two 
cases (with some ancillary programming, of course).

Your first example is a bigger challenge. MySQL can do full text searches, but 
from what I've read they can get painfully slow. I never encountered that 
problem, but my databases are rather small (~10 rows). For this technique, 
you would want to store all of your text in LONGTEXT columns.

I've also read that there are plug-ins that do the same thing, only faster.

I'm not sure how you would define a page of an e-book, and I suspect you 
would also deal with individual paragraphs or lines. My suggestion for that 
would be to have a book table, with such things as the title and author and 
perhaps ISBN; a page table identifying which paragraphs are on which page 
(for a given book); a paragraph table identifying which lines are in which 
paragraph; and then a lines table that contains the actual text of each 
line.

[book1, title, ...] - [book1, para1] - [para1, line1, linetext]
[book2, title, ...] [book1, para2] [para1, line2, linetext]
[book3, title, ...] [book1, para3] [para1, line3, linetext]
... [book1, para4] [para1, line4, linetext]
...[para1, line5, linetext]
   ...

This would let you have a full text index on the titles, and another on the 
linetext, with a number of ways to limit your searches. Because the linetext 
field would be relatively short, the search should be relatively fast even 
though there might be a relatively large number of records returned if you 
wanted to search entire books.

NOTE: Small test cases might yield surprising results because of the way full 
text searches determine relevancy! This has bitten me more than once.

This was fun, I hope my suggestions make sense.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341

www.the-infoshop.com




-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



RE: MySQL For Huge Collections

2010-06-11 Thread Gavin Towey
Agreed.

Consider keeping meta data about the book in your mysql database, but storing 
and serving the actual files from somewhere else.

If I were you, I'd use an external full text search engine like Sphinx or 
Lucene to handle something like searching for content inside the book.

Also, in terms of requirements, 300k books doesn't say a lot.  Looking at 
project Gutenberg, I see that an uncompressed text copy of Sherlock Holmes is 
only 500k, so you're talking about maybe 150G of data -- which is pretty 
moderate.

Sounds like a fun project though, good luck!

Regards,
Gavin Towey


-Original Message-
From: Peter Chacko [mailto:peterchack...@gmail.com]
Sent: Thursday, June 10, 2010 9:05 PM
To: SHAWN L.GREEN
Cc: Andy; mysql@lists.mysql.com
Subject: Re: MySQL For Huge Collections

Usually, you better use a NAS for such purpose. Database is designed
to store highly transactional, record oriented storage that needs fast
access... You can look for any Enterprise content management systems
that rest its storage on a scalable NAS, with file virtualization in
the long run.

thanks

On Fri, Jun 11, 2010 at 8:04 AM, SHAWN L.GREEN shawn.l.gr...@oracle.com wrote:
 On 6/10/2010 10:16 PM, Andy wrote:

 Hello all,

 I am new to MySQL and am exploring the possibility of using it for my
 work.
 I have about ~300,000 e-books, each about 100 pages long. I am first going
 to extract each chapter from each e-book and then basically store an
 e-book
 as a collection of chapters. A chapter could of course be arbitrarily long
 depending on the book.

 My questions are:

 (1) Can MySQL handle data of this size?
 (2) How can I store text (contents) of each chapter? What data type will
 be
 appropriate? longtext?
 (3) I only envision running queries to extract a specific chapter from a
 specific e-book (say extract the chapter titled ABC from e-book number
 XYZ
 (or e-book titled XYZ)). Can MySQL handle these types of queries well on
 data of this size?
 (4) What are the benefits/drawbacks of using MySQL compared to using XML
 databases?

 I look forward to help on this topic. Many thanks in advance.
 Andy


 Always pick the right tool for the job.

 MySQL may not be the best tool for serving up eBook contents. However if you
 want to index and locate contents based on various parameters, then it may
 be a good fit for the purpose.

 Your simple queries would best be handled by a basic web server or FTP
 server because you seem to want

 http://your.site.here/ABC/xyz

 where ABC is your book and xyz is your chapter.

 Those types of technology are VERY well suited for managing the repetitive
 streaming and distribution of large binary objects (chapter files) like you
 might encounter with an eBook content delivery system.

 --
 Shawn Green
 MySQL Principle Technical Support Engineer
 Oracle USA, Inc.
 Office: Blountville, TN

 --
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:
  http://lists.mysql.com/mysql?unsub=peterchack...@gmail.com



--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=gto...@ffn.com


This message contains confidential information and is intended only for the 
individual named.  If you are not the named addressee, you are notified that 
reviewing, disseminating, disclosing, copying or distributing this e-mail is 
strictly prohibited.  Please notify the sender immediately by e-mail if you 
have received this e-mail by mistake and delete this e-mail from your system. 
E-mail transmission cannot be guaranteed to be secure or error-free as 
information could be intercepted, corrupted, lost, destroyed, arrive late or 
incomplete, or contain viruses. The sender therefore does not accept liability 
for any loss or damage caused by viruses or errors or omissions in the contents 
of this message, which arise as a result of e-mail transmission. [FriendFinder 
Networks, Inc., 220 Humbolt court, Sunnyvale, CA 94089, USA, FriendFinder.com

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



MySQL For Huge Collections

2010-06-10 Thread Andy
Hello all,

I am new to MySQL and am exploring the possibility of using it for my work.
I have about ~300,000 e-books, each about 100 pages long. I am first going
to extract each chapter from each e-book and then basically store an e-book
as a collection of chapters. A chapter could of course be arbitrarily long
depending on the book.

My questions are:

(1) Can MySQL handle data of this size?
(2) How can I store text (contents) of each chapter? What data type will be
appropriate? longtext?
(3) I only envision running queries to extract a specific chapter from a
specific e-book (say extract the chapter titled ABC from e-book number XYZ
(or e-book titled XYZ)). Can MySQL handle these types of queries well on
data of this size?
(4) What are the benefits/drawbacks of using MySQL compared to using XML
databases?

I look forward to help on this topic. Many thanks in advance.
Andy


Re: MySQL For Huge Collections

2010-06-10 Thread SHAWN L.GREEN

On 6/10/2010 10:16 PM, Andy wrote:

Hello all,

I am new to MySQL and am exploring the possibility of using it for my work.
I have about ~300,000 e-books, each about 100 pages long. I am first going
to extract each chapter from each e-book and then basically store an e-book
as a collection of chapters. A chapter could of course be arbitrarily long
depending on the book.

My questions are:

(1) Can MySQL handle data of this size?
(2) How can I store text (contents) of each chapter? What data type will be
appropriate? longtext?
(3) I only envision running queries to extract a specific chapter from a
specific e-book (say extract the chapter titled ABC from e-book number XYZ
(or e-book titled XYZ)). Can MySQL handle these types of queries well on
data of this size?
(4) What are the benefits/drawbacks of using MySQL compared to using XML
databases?

I look forward to help on this topic. Many thanks in advance.
Andy



Always pick the right tool for the job.

MySQL may not be the best tool for serving up eBook contents. However if 
you want to index and locate contents based on various parameters, then 
it may be a good fit for the purpose.


Your simple queries would best be handled by a basic web server or FTP 
server because you seem to want


http://your.site.here/ABC/xyz

where ABC is your book and xyz is your chapter.

Those types of technology are VERY well suited for managing the 
repetitive streaming and distribution of large binary objects (chapter 
files) like you might encounter with an eBook content delivery system.


--
Shawn Green
MySQL Principle Technical Support Engineer
Oracle USA, Inc.
Office: Blountville, TN

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



Re: MySQL For Huge Collections

2010-06-10 Thread Peter Chacko
Usually, you better use a NAS for such purpose. Database is designed
to store highly transactional, record oriented storage that needs fast
access... You can look for any Enterprise content management systems
that rest its storage on a scalable NAS, with file virtualization in
the long run.

thanks

On Fri, Jun 11, 2010 at 8:04 AM, SHAWN L.GREEN shawn.l.gr...@oracle.com wrote:
 On 6/10/2010 10:16 PM, Andy wrote:

 Hello all,

 I am new to MySQL and am exploring the possibility of using it for my
 work.
 I have about ~300,000 e-books, each about 100 pages long. I am first going
 to extract each chapter from each e-book and then basically store an
 e-book
 as a collection of chapters. A chapter could of course be arbitrarily long
 depending on the book.

 My questions are:

 (1) Can MySQL handle data of this size?
 (2) How can I store text (contents) of each chapter? What data type will
 be
 appropriate? longtext?
 (3) I only envision running queries to extract a specific chapter from a
 specific e-book (say extract the chapter titled ABC from e-book number
 XYZ
 (or e-book titled XYZ)). Can MySQL handle these types of queries well on
 data of this size?
 (4) What are the benefits/drawbacks of using MySQL compared to using XML
 databases?

 I look forward to help on this topic. Many thanks in advance.
 Andy


 Always pick the right tool for the job.

 MySQL may not be the best tool for serving up eBook contents. However if you
 want to index and locate contents based on various parameters, then it may
 be a good fit for the purpose.

 Your simple queries would best be handled by a basic web server or FTP
 server because you seem to want

 http://your.site.here/ABC/xyz

 where ABC is your book and xyz is your chapter.

 Those types of technology are VERY well suited for managing the repetitive
 streaming and distribution of large binary objects (chapter files) like you
 might encounter with an eBook content delivery system.

 --
 Shawn Green
 MySQL Principle Technical Support Engineer
 Oracle USA, Inc.
 Office: Blountville, TN

 --
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:
  http://lists.mysql.com/mysql?unsub=peterchack...@gmail.com



--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org