Why are you storing first_seen in every record? To avoid searching for it when reports are generated?
On Sat, Aug 31, 2019 at 6:24 AM Rob Willett <rob.sql...@robertwillett.com> wrote: > Hi, > > We have a very similar system that captures traffic incident information > such as accidents, roadworks, traffic jams and sends personalised > information to drivers based on their routes. > > We realised after a few years that our original design for the database > was inefficient as we had a single table that was about 60GB, circa 500M > rows, that consisted of traffic incidents and uptakes. The inefficiency > was our design and not SQLite. > > This inefficiency showed when we used to run a daily report that > summarised the traffic information for historical trends, e.g. we wanted > to know the average time for a road accident to be cleared on a stretch > of road on a specific day or time of day. This report mean we read in > circa 500M rows and 60GB of data each night and took circa 90 mins to > run as we did a lot of calculations. > > We were not able to change the structure of the database too much (as it > was too far down the line). So we spent a long time looking at the data > with cold, wet towels on our foreheads in darkened rooms :) After we > analysed the data we realised that most of our 'updates' were not > actually updates at all an all were were actually interested in was the > start and end times of incidents, all the 'stuff' in the middle was just > noise. The problem we had was that disruptions could last a long time > and there was no actual end of incident marker, just that there wasn't > another disruption with that unique identifier and a later timestamp. > This sounds similar to your situation. Basically we constructed a simple > state model based on time. > > What we now do is to each night we run a Perl script each night to > remove anything that is between the first and end incidents. We know > that the end incident may move on the next day, but at that point in > time it is still the last/end/most recent indicator. Our data is > structured around time so we always have a time of insertion indicator > in the field as things may not be ordered by row id. > > Once we had this insight, we took our database down from 60GB to 800MB > and the report that ran from 90 mins to 90 secs. We have to give credit > to this mailing list for the patient and courteous way that helped and > assisted with our often stupid questions. > > The lessons we learnt from this are: > > 1. Get your database schema right from the go. We didn't. In hindsight > we should have spent longer looking at this. > > 2. Put in more information than you think you may need just in case. We > did that. We put time information in at per second granularity. > > 3. Don't worry too much about disk space unless you are on a constrained > device. In hindsight we could have stored even more data :) > > 4. Think hard about solutions that are simple. We love simplicity as > this makes it easier to go back and change things. Too many dependencies > would be too difficult to unpick. Keeping it simple also works for us as > we're not experts :) > > 5. SQLite seems to be able to do anything we want it to. We know the > 'limitations' of it not being a full client/server database, just to be > clear, we do not consider this a limitation at all but rather a virtue. > Other people seem worried about the 'lack' of some datatypes, we do > masses of data and date conversations as needed and it's never been a > speed issue or any issue. > > 6. Ask this group for help. We realised that the more information we > provided in a clear and concise manner in our help email, the more help > we got, it was a virtuous circle. The sum help of the people in this > group far, far exceeds our own knowledge. As a rule of thumb, if we ask > for help, we expect it to take 1-2 hours as a minimum for us to write > the email. That may be because we're not SQL experts. Also follow up > every email as if people have taken the time to reply to us, they > deserve a reply themselves. As we are English, we could end up in a > thanks for the reply endless loop :) The group has an exceptionally > signal to noise ratio and is invariably courteous. > > I would be astonished if you can't get SQLite to do what you want to do. > We have never managed to touch the sides of the system and suspect we > never will. > > More than happy to answer more questions as this group helped us and > it's only fair we offer the help back. I will state that we are not SQL > (or SQLite) experts :) > > Rob > > On 30 Aug 2019, at 21:44, Random Coder wrote: > > > First off, if this sort of "code review" style question is > > inappropriate for this list, please feel free to reply to me directly > > and tell me to stop, and I'll know to not do this again. > > > > That said, I have a question on the use of SQLite. At the risk of > > falling into the XY problem, I'll give some details on what i'm doing, > > and then how I'm doing it. My basic question is if what I'm doing is > > valid, and if I'm doing anything needlessly wasteful. > > > > I have a system monitoring events (and I know I'm being vague on the > > exact nature of these events, I can't provide too many details, sorry. > > Please try to accept what I say as given about them). The events have > > IDs that are generated externally, they're 30-40 ascii characters > > long, appear random, and known to be unique by external means for a > > given event. For the purposes of this particular problem, the only > > thing I care about tracking is when I first saw an event, and the last > > time I saw it. For better or worse, this ecosystem already thinks > > about timestamps as the number of minutes since a specific epoch, and > > is used to treating all time values as an integer in that space, I'm > > doing the same here. > > > > So, I have a RESTful server written in Python, using APSW to create a > > simple SQLite database: > > > > CREATE TABLE IF NOT EXISTS > > event( > > event_id TEXT PRIMARY KEY, > > first_seen INTEGER, > > last_seen INTEGER > > ) WITHOUT ROWID; > > > > Every time a new event comes in (they might be very out of order), I > > do a: > > > > INSERT INTO event(event_id, first_seen, last_seen) VALUES(?, ?, ?) > > ON CONFLICT(event_id) DO UPDATE SET > > first_seen = MIN(excluded.first_seen, hashes.first_seen), > > last_seen = MAX(excluded.last_seen, hashes.last_seen); > > > > To create the record for the event if it's new, or possibly update an > > existing one with new values. To give a sense of scale, I have around > > 5 billion events stored right now for the past couple of years in a > > 250gb database, and I see around 20 million per day, some small > > percentage of those are new. > > > > The important thing I can do for users is pull up reports. The report > > is roughly a summary of how old events are (when they were first seen, > > and how long they've been seen for). Outliers are highlighted, as are > > events that haven't been seen at all. The user will provide around > > ten thousand event IDs, the majority of them, approaching 99%, will > > exist in my database. When the user requests a report, I create an in > > memory database: > > > > ATTACH ':memory:' AS mem_db; > > CREATE TABLE mem_db.valid(event_id TEXT PRIMARY KEY); > > > > And populate that table with the events the user is interested in. > > I'm doing this since I won't get the list of items in one list, it's > > built up over some minutes. If the system running dies in the middle > > of a request, it's OK to start over. Then I run: > > > > SELECT > > mem_db.valid.event_id, > > event.first_seen, > > event.last_seen > > FROM > > mem_db.valid > > LEFT JOIN event ON > > event.event_id = mem_db.valid.event_id; > > > > And gather up the results and pretty them up for the user. > > > > Does all of this seem valid? It works, so I'm OK with it, but I'm far > > from a SQLite expert, and I want to know if I'm going to be backing > > myself into a corner or otherwise torturing things that should be done > > differently. Or, if the answer is: "Don't use SQLite for that", > > that's fine too, I'll start looking at other options. > > > > Thanks for any feedback. > > _______________________________________________ > > sqlite-users mailing list > > sqlite-users@mailinglists.sqlite.org > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users