Re: [Tutor] Which is better in principle: to store (in file) calculated data or to re-calculate it upon restarting program?

2019-07-31 Thread Alan Gauld via Tutor
On 31/07/2019 03:02, boB Stepp wrote:

> preceding scores plus the current one.  If the data in the file
> somehow got mangled, it would be an extraordinary coincidence for
> every row to yield a correct total score if that total score was
> recalculated from the corrupted data.

True but the likelihood of that happening is vanishingly small.
What is much more likely is that a couple of bits in the
entire file will be wrong. So a 5 becomes a 7 for example.
Remember that the data in the files is a character based
(assuming its a text file) not numerical. The conversion
to numbers happens when you read it. The conversion is more
likely to detect corrupted data than any calculations you perform.

> But the underlying question that I am trying to answer is how
> likely/unlikely is it for a file to get corrupted nowadays?  

It is still quite likely. Not as much as it was 40 years ago,
but still very much a possibility. Especially if the data
is stored/accessed over a network link. It is still very
much a real issue for anyone dealing with critical data.

> worthwhile verifying the integrity of every file in a program, or, at
> least, every data file accessed by a program every program run?  Which
> leads to your point...

Anything critical should go in a database. That will be much
less likely to get corrupted since most RDBMS systems include
data cleansing and verification as part of their function.
Also for working with large volumes of data(where corruption
risk rises just because of the volumes) a database is a more
effective way of storing data anyway.

>> Checking data integrity is what checksums are for.
> 
> When should this be done in  normal programming practice?

Any time you gave a critical piece of data in a text file.
If it is important to know that the data has changed (for
any reason, not just data corruption) then use a checksum.
Certainly if it's publicly available or you plan on shipping
it over a network a checksum is a good idea.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Which is better in principle: to store (in file) calculated data or to re-calculate it upon restarting program?

2019-07-31 Thread Chris Roy-Smith

On 31/7/19 2:21 am, boB Stepp wrote:

I have been using various iterations of a solitaire scorekeeper
program to explore different programming thoughts.  In my latest
musings I am wondering about -- in general -- whether it is best to
store calculated data values in a file and reload these values, or
whether to recalculate such data upon each new run of a program.  In
terms of my solitaire scorekeeper program is it better to store "Hand
Number, Date, Time, Score, Total Score" or instead, "Hand Number,
Date, Time, Score"?  Of course I don't really need to store hand
number since it is easily determined by its row/record number in its
csv file.

In this trivial example I cannot imagine there is any realistic
difference between the two approaches, but I am trying to generalize
my thoughts for potentially much more expensive calculations, very
large data sets, and what is the likelihood of storage errors
occurring in files.  Any thoughts on this?

TIA!

From a scientific viewpoint, you want to keep the raw data, so you can 
perform other calculations that you may not have thought of yet. But 
that's not got much to do with programming ;)

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Which is better in principle: to store (in file) calculated data or to re-calculate it upon restarting program?

2019-07-30 Thread boB Stepp
On Tue, Jul 30, 2019 at 7:26 PM Mats Wichmann  wrote:
>
> On 7/30/19 5:58 PM, Alan Gauld via Tutor wrote:
> > On 30/07/2019 17:21, boB Stepp wrote:
> >
> >> musings I am wondering about -- in general -- whether it is best to
> >> store calculated data values in a file and reload these values, or
> >> whether to recalculate such data upon each new run of a program.
> >
> > It depends on the use case.
> >
> > For example a long running server process may not care about startup
> > delays because it only starts once (or at least very rarely) so either
> > approach would do but saving diskspace may be helpful so calculate the
> > values.
> >
> > On the other hand a data batch processor running once as part of a
> > chain working with high data volumes probably needs to start quickly.
> > In which case do the calculations take longer than reading the
> > extra data? Probably, so store in a file.
> >
> > There are other options too such as calculating the value every
> > time it is used - only useful if the data might change
> > dynamically during the program execution.
> >
> > It all depends on how much data?, how often it is used?,
> > how often would it be calculated? How long does the process
> > run for? etc.
>
>
> Hey, boB - I bet you *knew* the answer was going to be "it depends" :)

You are coming to know me all too well! ~(:>))

I just wanted to check with the professionals here if my thinking
(Concealed behind the asked questions.) was correct or, if not, where
I am off.

> There are two very common classes of application that have to make this
> very decision - real databases, and their toy cousins, spreadsheets.
>
> In the relational database world - characterized by very long-running
> processes (like: unless it crashes, runs until reboot. and maybe even
> beyond that - if you have a multi-mode replicated or distributed DB it
> may survive failure of one point) - if a field is calculated it's not
> stored. Because - what Alan said: in an RDBMS, data are _expected_ to
> change during runtime. And then for performance reasons, there may be
> some cases where it's precomputed and stored to avoid huge delays when
> the computation is expensive. That world even has a term for that: a
> materialized view (in contrast to a regular view).  It can get pretty
> tricky, you need something that causes the materialized view to update
> when data has changed; for databases that don't natively support the
> behavior you then have to fiddle with triggers and hopefully it works
> out.  More enlightened now?

Not more enlightened, perhaps, but more convinced than ever on how
difficult it is to manage the complexity of real world programs.
-- 
boB
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Which is better in principle: to store (in file) calculated data or to re-calculate it upon restarting program?

2019-07-30 Thread boB Stepp
On Tue, Jul 30, 2019 at 7:05 PM Alan Gauld via Tutor  wrote:
>
> On 30/07/2019 18:20, boB Stepp wrote:
>
> > What is the likelihood of file storage corruption?  I have a vague
> > sense that in earlier days of computing this was more likely to
> > happen, but nowadays?  Storing and recalculating does act as a good
> > data integrity check of the file data.
>
> No it doesn't! You are quite likely to get a successful calculation
> using nonsense data and therefore invalid results. But they look
> valid - a number is a number...

Though I may be dense here, for the particular example I started with
the total score in a solitaire game is equal to the sum of all of the
preceding scores plus the current one.  If the data in the file
somehow got mangled, it would be an extraordinary coincidence for
every row to yield a correct total score if that total score was
recalculated from the corrupted data.

But the underlying question that I am trying to answer is how
likely/unlikely is it for a file to get corrupted nowadays?  Is it
worthwhile verifying the integrity of every file in a program, or, at
least, every data file accessed by a program every program run?  Which
leads to your point...

> Checking data integrity is what checksums are for.

When should this be done in  normal programming practice?

-- 
boB
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Which is better in principle: to store (in file) calculated data or to re-calculate it upon restarting program?

2019-07-30 Thread Mats Wichmann
On 7/30/19 5:58 PM, Alan Gauld via Tutor wrote:
> On 30/07/2019 17:21, boB Stepp wrote:
> 
>> musings I am wondering about -- in general -- whether it is best to
>> store calculated data values in a file and reload these values, or
>> whether to recalculate such data upon each new run of a program.  
> 
> It depends on the use case.
> 
> For example a long running server process may not care about startup
> delays because it only starts once (or at least very rarely) so either
> approach would do but saving diskspace may be helpful so calculate the
> values.
> 
> On the other hand a data batch processor running once as part of a
> chain working with high data volumes probably needs to start quickly.
> In which case do the calculations take longer than reading the
> extra data? Probably, so store in a file.
> 
> There are other options too such as calculating the value every
> time it is used - only useful if the data might change
> dynamically during the program execution.
> 
> It all depends on how much data?, how often it is used?,
> how often would it be calculated? How long does the process
> run for? etc.


Hey, boB - I bet you *knew* the answer was going to be "it depends" :)

There are two very common classes of application that have to make this
very decision - real databases, and their toy cousins, spreadsheets.

In the relational database world - characterized by very long-running
processes (like: unless it crashes, runs until reboot. and maybe even
beyond that - if you have a multi-mode replicated or distributed DB it
may survive failure of one point) - if a field is calculated it's not
stored. Because - what Alan said: in an RDBMS, data are _expected_ to
change during runtime. And then for performance reasons, there may be
some cases where it's precomputed and stored to avoid huge delays when
the computation is expensive. That world even has a term for that: a
materialized view (in contrast to a regular view).  It can get pretty
tricky, you need something that causes the materialized view to update
when data has changed; for databases that don't natively support the
behavior you then have to fiddle with triggers and hopefully it works
out.  More enlightened now?

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Which is better in principle: to store (in file) calculated data or to re-calculate it upon restarting program?

2019-07-30 Thread Alan Gauld via Tutor
On 30/07/2019 18:20, boB Stepp wrote:

> What is the likelihood of file storage corruption?  I have a vague
> sense that in earlier days of computing this was more likely to
> happen, but nowadays?  Storing and recalculating does act as a good
> data integrity check of the file data.

No it doesn't! You are quite likely to get a successful calculation
using nonsense data and therefore invalid results. But they look
valid - a number is a number...

Checking data integrity is what checksums are for.


-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Which is better in principle: to store (in file) calculated data or to re-calculate it upon restarting program?

2019-07-30 Thread Alan Gauld via Tutor
On 30/07/2019 17:21, boB Stepp wrote:

> musings I am wondering about -- in general -- whether it is best to
> store calculated data values in a file and reload these values, or
> whether to recalculate such data upon each new run of a program.  

It depends on the use case.

For example a long running server process may not care about startup
delays because it only starts once (or at least very rarely) so either
approach would do but saving diskspace may be helpful so calculate the
values.

On the other hand a data batch processor running once as part of a
chain working with high data volumes probably needs to start quickly.
In which case do the calculations take longer than reading the
extra data? Probably, so store in a file.

There are other options too such as calculating the value every
time it is used - only useful if the data might change
dynamically during the program execution.

It all depends on how much data?, how often it is used?,
how often would it be calculated? How long does the process
run for? etc.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Which is better in principle: to store (in file) calculated data or to re-calculate it upon restarting program?

2019-07-30 Thread boB Stepp
On Tue, Jul 30, 2019 at 12:05 PM Zachary Ware
 wrote:
>
> On Tue, Jul 30, 2019 at 11:24 AM boB Stepp  wrote:
> > In this trivial example I cannot imagine there is any realistic
> > difference between the two approaches, but I am trying to generalize
> > my thoughts for potentially much more expensive calculations, very
> > large data sets, and what is the likelihood of storage errors
> > occurring in files.  Any thoughts on this?
>
> As with many things in programming, it comes down to how much time you
> want to trade for space.  If you have a lot of space and not much
> time, store the calculated values.  If you have a lot of time (or the
> calculation time is negligible) and not much space, recalculate every
> time.  If you have plenty of both, store it and recalculate it anyway

What is the likelihood of file storage corruption?  I have a vague
sense that in earlier days of computing this was more likely to
happen, but nowadays?  Storing and recalculating does act as a good
data integrity check of the file data.

-- 
boB
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Which is better in principle: to store (in file) calculated data or to re-calculate it upon restarting program?

2019-07-30 Thread Zachary Ware
On Tue, Jul 30, 2019 at 11:24 AM boB Stepp  wrote:
> In this trivial example I cannot imagine there is any realistic
> difference between the two approaches, but I am trying to generalize
> my thoughts for potentially much more expensive calculations, very
> large data sets, and what is the likelihood of storage errors
> occurring in files.  Any thoughts on this?

As with many things in programming, it comes down to how much time you
want to trade for space.  If you have a lot of space and not much
time, store the calculated values.  If you have a lot of time (or the
calculation time is negligible) and not much space, recalculate every
time.  If you have plenty of both, store it and recalculate it anyway
:).  Storing the information can also be useful for offline debugging.

-- 
Zach
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor