Why are you storing first_seen in every record?  To avoid searching for it
when reports are generated?

On Sat, Aug 31, 2019 at 6:24 AM Rob Willett <rob.sql...@robertwillett.com>
wrote:

> Hi,
>
> We have a very similar system that captures traffic incident information
> such as accidents, roadworks, traffic jams and sends personalised
> information to drivers based on their routes.
>
> We realised after a few years that our original design for the database
> was inefficient as we had a single table that was about 60GB, circa 500M
> rows, that consisted of traffic incidents and uptakes. The inefficiency
> was our design and not SQLite.
>
> This inefficiency showed when we used to run a daily report that
> summarised the traffic information for historical trends, e.g. we wanted
> to know the average time for a road accident to be cleared on a stretch
> of road on a specific day or time of day. This report mean we read in
> circa 500M rows and 60GB of data each night and took circa 90 mins to
> run as we did a lot of calculations.
>
> We were not able to change the structure of the database too much (as it
> was too far down the line). So we spent a long time looking at the data
> with cold, wet towels on our foreheads in darkened rooms :) After we
> analysed the data we realised that most of our 'updates' were not
> actually updates at all an all were were actually interested in was the
> start and end times of incidents, all the 'stuff' in the middle was just
> noise.  The problem we had was that disruptions could last a long time
> and there was no actual end of incident marker, just that there wasn't
> another disruption with that unique identifier and a later timestamp.
> This sounds similar to your situation. Basically we constructed a simple
> state model based on time.
>
> What we now do is to each night we run a Perl script each night to
> remove anything that is between the first and end incidents. We know
> that the end incident may move on the next day, but at that point in
> time it is still the last/end/most recent indicator. Our data is
> structured around time so we always have a time of insertion indicator
> in the field as things may not be ordered by row id.
>
> Once we had this insight, we took our database down from 60GB to 800MB
> and the report that ran from 90 mins to 90 secs. We have to give credit
> to this mailing list for the patient and courteous way that helped and
> assisted with our often stupid questions.
>
> The lessons we learnt from this are:
>
> 1. Get your database schema right from the go. We didn't. In hindsight
> we should have spent longer looking at this.
>
> 2. Put in more information than you think you may need just in case. We
> did that. We put time information in at per second granularity.
>
> 3. Don't worry too much about disk space unless you are on a constrained
> device. In hindsight we could have stored even more data :)
>
> 4. Think hard about solutions that are simple. We love simplicity as
> this makes it easier to go back and change things. Too many dependencies
> would be too difficult to unpick. Keeping it simple also works for us as
> we're not experts :)
>
> 5. SQLite seems to be able to do anything we want it to. We know the
> 'limitations' of it not being a full client/server database, just to be
> clear, we do not consider this a limitation at all but rather a virtue.
> Other people seem worried about the 'lack' of some datatypes, we do
> masses of data and date conversations as needed and it's never been a
> speed issue or any issue.
>
> 6. Ask this group for help. We realised that the more information we
> provided in a clear and concise manner in our help email, the more help
> we got, it was a virtuous circle. The sum help of the people in this
> group far, far exceeds our own knowledge. As a rule of thumb, if we ask
> for help, we expect it to take 1-2 hours as a minimum for us to write
> the email. That may be because we're not SQL experts. Also follow up
> every email as if people have taken the time to reply to us, they
> deserve a reply themselves. As we are English, we could end up in a
> thanks for the reply endless loop :) The group has an exceptionally
> signal to noise ratio and is invariably courteous.
>
> I would be astonished if you can't get SQLite to do what you want to do.
> We have never managed to touch the sides of the system and suspect we
> never will.
>
> More than happy to answer more questions as this group helped us and
> it's only fair we offer the help back. I will state that we are not SQL
> (or SQLite) experts :)
>
> Rob
>
> On 30 Aug 2019, at 21:44, Random Coder wrote:
>
> > First off, if this sort of "code review" style question is
> > inappropriate for this list, please feel free to reply to me directly
> > and tell me to stop, and I'll know to not do this again.
> >
> > That said, I have a question on the use of SQLite.  At the risk of
> > falling into the XY problem, I'll give some details on what i'm doing,
> > and then how I'm doing it.  My basic question is if what I'm doing is
> > valid, and if I'm doing anything needlessly wasteful.
> >
> > I have a system monitoring events (and I know I'm being vague on the
> > exact nature of these events, I can't provide too many details, sorry.
> > Please try to accept what I say as given about them).  The events have
> > IDs that are generated externally, they're 30-40 ascii characters
> > long, appear random, and known to be unique by external means for a
> > given event.  For the purposes of this particular problem, the only
> > thing I care about tracking is when I first saw an event, and the last
> > time I saw it.  For better or worse, this ecosystem already thinks
> > about timestamps as the number of minutes since a specific epoch, and
> > is used to treating all time values as an integer in that space, I'm
> > doing the same here.
> >
> > So, I have a RESTful server written in Python, using APSW to create a
> > simple SQLite database:
> >
> >     CREATE TABLE IF NOT EXISTS
> >         event(
> >             event_id TEXT PRIMARY KEY,
> >             first_seen INTEGER,
> >             last_seen INTEGER
> >         ) WITHOUT ROWID;
> >
> > Every time a new event comes in (they might be very out of order), I
> > do a:
> >
> >     INSERT INTO event(event_id, first_seen, last_seen) VALUES(?, ?, ?)
> >     ON CONFLICT(event_id) DO UPDATE SET
> >         first_seen = MIN(excluded.first_seen, hashes.first_seen),
> >         last_seen = MAX(excluded.last_seen, hashes.last_seen);
> >
> > To create the record for the event if it's new, or possibly update an
> > existing one with new values.  To give a sense of scale, I have around
> > 5 billion events stored right now for the past couple of years in a
> > 250gb database, and I see around 20 million per day, some small
> > percentage of those are new.
> >
> > The important thing I can do for users is pull up reports.  The report
> > is roughly a summary of how old events are (when they were first seen,
> > and how long they've been seen for).  Outliers are highlighted, as are
> > events that haven't been seen at all.  The user will provide around
> > ten thousand event IDs, the majority of them, approaching 99%, will
> > exist in my database.  When the user requests a report, I create an in
> > memory database:
> >
> >    ATTACH ':memory:' AS mem_db;
> >    CREATE TABLE mem_db.valid(event_id TEXT PRIMARY KEY);
> >
> > And populate that table with the events the user is interested in.
> > I'm doing this since I won't get the list of items in one list, it's
> > built up over some minutes.  If the system running dies in the middle
> > of a request, it's OK to start over.  Then I run:
> >
> >     SELECT
> >         mem_db.valid.event_id,
> >         event.first_seen,
> >         event.last_seen
> >     FROM
> >         mem_db.valid
> >     LEFT JOIN event ON
> >         event.event_id = mem_db.valid.event_id;
> >
> > And gather up the results and pretty them up for the user.
> >
> > Does all of this seem valid?  It works, so I'm OK with it, but I'm far
> > from a SQLite expert, and I want to know if I'm going to be backing
> > myself into a corner or otherwise torturing things that should be done
> > differently.  Or, if the answer is: "Don't use SQLite for that",
> > that's fine too, I'll start looking at other options.
> >
> > Thanks for any feedback.
> > _______________________________________________
> > sqlite-users mailing list
> > sqlite-users@mailinglists.sqlite.org
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to