Re: [sqlite] [EXTERNAL] Save text file content in db: lines or whole file?

R Smith Mon, 06 Aug 2018 07:21:02 -0700

On 2018/08/06 12:00 PM, R Smith wrote:

I need to save text files (let say between 1 KB to 20 MB) in a SQLiteDB.
Why not do both?
If it was me, I would write some code to split the text into sentences(not lines - which is rather easy in English, but might be harder insome other languages).

//...

I've received two off-line questions as to how I could parse text intosentences in "English" even, and thought I would reply here since itmight clear up the confusion for others too.

The said questions indicated that the authors probably imagined mepossessing some fancy AI comprehending the language into whatconstitutes notional sentences (Subject+Predicate) or such, but I fearthe meaning was much more arbitrary, based on common syntax for writtenEnglish - as William Faulkner wrote in a letter to Malcolm Cowley:

*"I am trying to say it all in one sentence, between one Cap and oneperiod."*

Think of paragraphs in English as large records delimited by 2 or moreLine-break characters (#10+#13 or perhaps only #10 if on a *nixplatform) between texts.

Each paragraph record could be comprised of one or more sentences (inEnglish) as records delimited by a full-stop+Space orfull-stop+linebreak, or even simply the paragraph end.

By these simple rules, the following can easily parsed into 1 paragraphwith 2 sentences and a second paragraph with 1 sentence (lines here usedas formatting only, actual line-breaks indicated with "<-" marker):

<-
The quick brown fox jumps over the
lazy dog.  My grandma said to your
grandma, I'm gonna set your flag
on fire.<-
<-
Next paragraph here...<-
<-

Now a more difficult paragraph would be a the following, all of whichwould translate in to 1 single sentence if only the above rules arecatered for:

<-
I have three wishes:<-
  - to be outlived by my children<-
  - to fly in space once before I die<-
  - to see Halley's comet once more<-
<-

That will be a single-sentenced paragraph. It's up to theend-implementation to gauge whether that would be sufficient a split ornot.

To put this into a DB, I would strip out the line-breaks insidesentences (perhaps not strip out, but replace with space characters,much like HTML does) to make them more easily handled as "lines". Thefinal DB table might then look like this:


ID |  fileID | parNo | parLineNo | docLineNo | txtLine

1 | 1 | 1 | 1 | 1 | The quick brown foxjumps over the lazy dog. 2 | 1 | 1 | 2 | 2 | My grandma said to yourgrandma, I'm gonna set your flag on fire.

 3 |     1   |   2   |     1     |     3     | Next paragraph here...

4 | 1 | 3 | 1 | 4 | I have three wishes: -to be outlived by my children - to fly in space once before I die - tosee Halley's comet once more

So yes, not a perfect walk-in-the-park, but easy to do for basic textparsing.Stating the obvious: If the intent is to re-construct the file 100%exact (so it scores the same output for a hashing algorithm) then youcannot strip out line-breaks and you need to carefully include each andevery character byte-for-byte used to split paragraphs and the like. Itall depends on the implementation requirements.

The above text format should hold for 99.9% of English literature textthat can be had in text files (i.e. no images, tables, etc.). Not soeasy for scientific papers, research material, movie scripts and a fewothers.


Sorry for not presenting that great AI solution.  :)
Ryan

_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] [EXTERNAL] Save text file content in db: lines or whole file?

Reply via email to