I need to save text files (let say between 1 KB to 20 MB) in a SQLite DB.

I see two possibilities:

1) save all the content in a single column:

create table content(id integer not null primary key,
                      text blob not null);

2) split the content in lines:

create table line(content integer not null,
                   line integer not null,
                   text blob not null,
                   primary key(content, line));

Some queries will need to extract the whole file, while other queries will need 
to extract the text for a range of lines.

According to your experience it is better/faster the first option, the second 
option or a smarter option I've not considered?

Why not do both?

If it was me, I would write some code to split the text into sentences (not lines - which is rather easy in English, but might be harder in some other languages).

Then I would save the lines in a two-table Database like this:

CREATE TABLE files(
  ID INTEGER PRIMARY KEY, -- This is only to get a handle on the rowid
  fileName TEXT,  -- Add COLLATE NOCASE if the file system is case insensitive, like Windows.
  filePath TEXT,
  ...  -- Any other things you want to store about the origin file, or perhaps a timestamp etc.
);

CREATE TABLE content(
  ID INTEGER PRIMARY KEY,
  fileID INT NOT NULL REFERENCES files(ID), -- This to know which origin file [*].   lineNo INT NOT NULL,  -- So that a line can be referenced by position into a file.
  txtLine TEXT,
  ...
);

Now you can easily query all lines with a specific fileID to see the entire document, or JOIN by fileName even to list the content of any file by name, or simply refer to any specific line in any file by either its fileID+lineNo or simply its own ID.  I would probably go further (since I'm code-parsing the file anyway) and include a paragraph number or even chapter + page numbers if that's relevant.

This approach will work whether you split lines into sentences or just physical positional lines, though a sentence-split makes more sense to me (unless it's a data format).

[*] - I'm only showing basic options, but you'd typically want to add some ON UPDATE CASCADE or ON DELETE referencing methods.


Cheers,
Ryan

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to