Re: NAND out of space crash

2008-07-21 Thread John Watlington

I think this is a huge problem.   Here in Uruguay they are seeing
a flood of machines with this problem, and it will only get worse
over time (and we will encounter this in every other deployment
soon.)

They desperately need a fix...

wad

On Jul 21, 2008, at 12:55 PM, Greg Smith wrote:

 Hi All,

 I found http://dev.laptop.org/ticket/7125 which looks like a good  
 place
 to track this problem.

 I marked it blocker for 8.2.0.

 Here's what I think we need:
 - Sugar GUI always starts, no matter how much space is free on the  
 NAND.
 - If Sugar starts and you are low on space (exact size tbd) then we
 should alert the user to start clearing space in the journal.

 I think Eben will work on the second part. Can someone solve the first
 part?

 Suggested steps would be to propose a solution, get buy in, code it  
 and
 check it in.

 I shouldn't have mentioned partitioning :-( All I meant was that we
 cannot solve this on upgrade by whacking all user data.

 Thanks,

 Greg S

 Date: Sat, 19 Jul 2008 12:39:04 -0400
 From: Erik Garrison [EMAIL PROTECTED]
 Subject: Re: NAND out of space crash (was Display warnings in sugar
  (Emiliano Pastorino))
 To: [EMAIL PROTECTED]
 Cc: devel@lists.laptop.org
 Message-ID: [EMAIL PROTECTED]
 Content-Type: text/plain; charset=us-ascii

 On Sat, Jul 19, 2008 at 11:47:21AM -0400, Greg Smith wrote:
 Hi All,

 Emiliano has an elegant workaround but crashing the XO on NAND  
 full (to
 un-recoverable state?) is a heinous bug that affects essentially  
 all users.

 If someone has the bug ID handy can you send it out and mark it a
 blocker for 8.2.0 (priority = blocker and keyword includes blocks: 
 8.2.0)?

 Can I get a design proposal (no re-partitioning please!), scoping  
 and
 lead engineer on it ASAP?

 If you have to stop working on something else to do this, let me  
 know
 what will drop and I'll help weigh the consequences.

 My impression is that the long-term benefits of partitioning mean  
 that
 it's worthwhile to devote effort to it.  Are we not going to work on
 partitioning in the future?

 In addition to a more solid solution to the NAND fillup issue, we get
 the opportunity to improve system performance and upgrade procedures.
 Partitioning will allow us to test out LZO data compression for  
 the XO's
 filesystems (excluding /boot and /security).  We would expect a
 significant i/o performance boost from the use of LZO.  Additionally,
 partitioning would improve OFW-level system updates (e.g. copy- 
 nand) by
 making it far simpler for the update procedure to leave user data
 intact.

 That said there are obviously a lot of troubles with partitioning.
 Updating an existing system to a partitioned one without mashing user
 data is a major issue.

 Erik


 ___
 Devel mailing list
 Devel@lists.laptop.org
 http://lists.laptop.org/listinfo/devel

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: NAND out of space crash

2008-07-21 Thread Jim Gettys
There are two issues here that we should be sure to not intertwingle:

1) whatever behavior Sugar may have when low/out of space, during
operation, or at boot time.

2) JFFS2's behavior when the file system is almost full.  When it gets
almost full, it can spend all its time trying to garbage collect, and
you can lose completely (the system sort of gets the slows, and grinds
to a halt).

As to 2), there are patches done by Nokia (deployed on the N800 and
similar devices) that reserve some extra space and report out of space
before the system gets the slows.  These are in Dave's incoming queue
to merge into JFFS2 the last I heard.  I don't know if he's merged them.
- Jim




On Mon, 2008-07-21 at 13:45 -0300, John Watlington wrote:
 I think this is a huge problem.   Here in Uruguay they are seeing
 a flood of machines with this problem, and it will only get worse
 over time (and we will encounter this in every other deployment
 soon.)
 
 They desperately need a fix...
 
 wad
 
 On Jul 21, 2008, at 12:55 PM, Greg Smith wrote:
 
  Hi All,
 
  I found http://dev.laptop.org/ticket/7125 which looks like a good  
  place
  to track this problem.
 
  I marked it blocker for 8.2.0.
 
  Here's what I think we need:
  - Sugar GUI always starts, no matter how much space is free on the  
  NAND.
  - If Sugar starts and you are low on space (exact size tbd) then we
  should alert the user to start clearing space in the journal.
 
  I think Eben will work on the second part. Can someone solve the first
  part?
 
  Suggested steps would be to propose a solution, get buy in, code it  
  and
  check it in.
 
  I shouldn't have mentioned partitioning :-( All I meant was that we
  cannot solve this on upgrade by whacking all user data.
 
  Thanks,
 
  Greg S
 
  Date: Sat, 19 Jul 2008 12:39:04 -0400
  From: Erik Garrison [EMAIL PROTECTED]
  Subject: Re: NAND out of space crash (was Display warnings in sugar
 (Emiliano Pastorino))
  To: [EMAIL PROTECTED]
  Cc: devel@lists.laptop.org
  Message-ID: [EMAIL PROTECTED]
  Content-Type: text/plain; charset=us-ascii
 
  On Sat, Jul 19, 2008 at 11:47:21AM -0400, Greg Smith wrote:
  Hi All,
 
  Emiliano has an elegant workaround but crashing the XO on NAND  
  full (to
  un-recoverable state?) is a heinous bug that affects essentially  
  all users.
 
  If someone has the bug ID handy can you send it out and mark it a
  blocker for 8.2.0 (priority = blocker and keyword includes blocks: 
  8.2.0)?
 
  Can I get a design proposal (no re-partitioning please!), scoping  
  and
  lead engineer on it ASAP?
 
  If you have to stop working on something else to do this, let me  
  know
  what will drop and I'll help weigh the consequences.
 
  My impression is that the long-term benefits of partitioning mean  
  that
  it's worthwhile to devote effort to it.  Are we not going to work on
  partitioning in the future?
 
  In addition to a more solid solution to the NAND fillup issue, we get
  the opportunity to improve system performance and upgrade procedures.
  Partitioning will allow us to test out LZO data compression for  
  the XO's
  filesystems (excluding /boot and /security).  We would expect a
  significant i/o performance boost from the use of LZO.  Additionally,
  partitioning would improve OFW-level system updates (e.g. copy- 
  nand) by
  making it far simpler for the update procedure to leave user data
  intact.
 
  That said there are obviously a lot of troubles with partitioning.
  Updating an existing system to a partitioned one without mashing user
  data is a major issue.
 
  Erik
 
 
  ___
  Devel mailing list
  Devel@lists.laptop.org
  http://lists.laptop.org/listinfo/devel
 
 ___
 Devel mailing list
 Devel@lists.laptop.org
 http://lists.laptop.org/listinfo/devel
-- 
Jim Gettys [EMAIL PROTECTED]
One Laptop Per Child

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: NAND out of space crash

2008-07-21 Thread C. Scott Ananian
On Mon, Jul 21, 2008 at 12:52 PM, Jim Gettys [EMAIL PROTECTED] wrote:
 There are two issues here that we should be sure to not intertwingle:

 1) whatever behavior Sugar may have when low/out of space, during
 operation, or at boot time.

A number of independent issues here:
 a) the initscripts should be sure to unfreeze the dcon if/when X
fails to start.  This ensures that the system is obviously recoverable
(you can recover by rebooting with the check key held down, but this
is not obvious!).
 b) sugar should, ideally, start even if flash is full.   It is
currently failing when writing to ~olpc/.boot_time or some such, and
crashing.
 c) once sugar starts, there should be a message indicating that the
NAND is critically full.
 d) trying to save new content to the journal should also give an
obvious message that the NAND is full.
 e) removing content from the journal should work even if NAND is full.

I think (a), (b), and (e) are critical for 8.2.  (c) is being handled
independently by Uruguay, and (c) and (d) should be targets for 9.1.

 2) JFFS2's behavior when the file system is almost full.  When it gets
 almost full, it can spend all its time trying to garbage collect, and
 you can lose completely (the system sort of gets the slows, and grinds
 to a halt).

 As to 2), there are patches done by Nokia (deployed on the N800 and
 similar devices) that reserve some extra space and report out of space
 before the system gets the slows.  These are in Dave's incoming queue
 to merge into JFFS2 the last I heard.  I don't know if he's merged them.

These are less critical, IMO.  I have filled up NAND, and the slows
are not debilitating.  The issues above are. We should encourage Dave
to fix this issue and the other known JFFS2 bugs (trac #6480, for
instance)  -- or get dsaxena to do so -- for 9.1.
 --scott

-- 
 ( http://cscott.net/ )
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: NAND out of space crash

2008-07-21 Thread Deepak Saxena
On Jul 21 2008, at 13:39, C. Scott Ananian was caught saying:
  2) JFFS2's behavior when the file system is almost full.  When it gets
  almost full, it can spend all its time trying to garbage collect, and
  you can lose completely (the system sort of gets the slows, and grinds
  to a halt).
 
  As to 2), there are patches done by Nokia (deployed on the N800 and
  similar devices) that reserve some extra space and report out of space
  before the system gets the slows.  These are in Dave's incoming queue
  to merge into JFFS2 the last I heard.  I don't know if he's merged them.
 
 These are less critical, IMO.  I have filled up NAND, and the slows
 are not debilitating.  The issues above are. We should encourage Dave
 to fix this issue and the other known JFFS2 bugs (trac #6480, for
 instance)  -- or get dsaxena to do so -- for 9.1.

#6480 is fixed as of yesterday, should be in next joyride.

I'll be re-doing Nokia's patches so that they go upstream if we still want
them after 8.2 is out; however, I don't think the approach used by them 
actually 
helps us.  We already have a very limited amount of storage space and reserving 
space for the root user just reduces what the end user can actually use.

I think analyzing performance of non-JFFS2 file systems and picking
a replacement should be a high-priority item for 9.1 update.

~Deepak

-- 
Deepak Saxena [EMAIL PROTECTED]
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: NAND out of space crash

2008-07-21 Thread Jim Gettys
On Mon, 2008-07-21 at 09:51 -0700, Deepak Saxena wrote:
 On Jul 21 2008, at 13:39, C. Scott Ananian was caught saying:
   2) JFFS2's behavior when the file system is almost full.  When it gets
   almost full, it can spend all its time trying to garbage collect, and
   you can lose completely (the system sort of gets the slows, and grinds
   to a halt).
  
   As to 2), there are patches done by Nokia (deployed on the N800 and
   similar devices) that reserve some extra space and report out of space
   before the system gets the slows.  These are in Dave's incoming queue
   to merge into JFFS2 the last I heard.  I don't know if he's merged them.
  
  These are less critical, IMO.  I have filled up NAND, and the slows
  are not debilitating.  The issues above are. We should encourage Dave
  to fix this issue and the other known JFFS2 bugs (trac #6480, for
  instance)  -- or get dsaxena to do so -- for 9.1.
 
 #6480 is fixed as of yesterday, should be in next joyride.
 
 I'll be re-doing Nokia's patches so that they go upstream if we still want
 them after 8.2 is out; however, I don't think the approach used by them 
 actually 
 helps us.  We already have a very limited amount of storage space and 
 reserving 
 space for the root user just reduces what the end user can actually use.

IIRC, the issue is the GC runs more and more often the closer to full
you run.  By reserving some space, you avoid the performance cliff.

Since we expect to be running nearly full most of the time, it would
seem to me avoiding this cliff is important.

 
 I think analyzing performance of non-JFFS2 file systems and picking
 a replacement should be a high-priority item for 9.1 update.

No argument here
  - Jim

 
 ~Deepak
 
-- 
Jim Gettys [EMAIL PROTECTED]
One Laptop Per Child

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: NAND out of space crash

2008-07-21 Thread Tomeu Vizoso
Hi, agreed on the action items, not so sure about the roadmap.

On Mon, Jul 21, 2008 at 7:39 PM, C. Scott Ananian [EMAIL PROTECTED] wrote:
  d) trying to save new content to the journal should also give an
 obvious message that the NAND is full.

Should the DS also reserve some free space?

Regards,

Tomeu
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: NAND out of space crash

2008-07-21 Thread David Woodhouse
On Mon, 2008-07-21 at 09:51 -0700, Deepak Saxena wrote:
 On Jul 21 2008, at 13:39, C. Scott Ananian was caught saying:
   2) JFFS2's behavior when the file system is almost full.  When it gets
   almost full, it can spend all its time trying to garbage collect, and
   you can lose completely (the system sort of gets the slows, and grinds
   to a halt).
  
   As to 2), there are patches done by Nokia (deployed on the N800 and
   similar devices) that reserve some extra space and report out of space
   before the system gets the slows.  These are in Dave's incoming queue
   to merge into JFFS2 the last I heard.  I don't know if he's merged them.
  
  These are less critical, IMO.  I have filled up NAND, and the slows
  are not debilitating.  The issues above are. We should encourage Dave
  to fix this issue and the other known JFFS2 bugs (trac #6480, for
  instance)  -- or get dsaxena to do so -- for 9.1.
 
 #6480 is fixed as of yesterday, should be in next joyride.

Yeah. Since it was purely cosmetic I figured it might as well just wait
to come through 'naturally'.

 I'll be re-doing Nokia's patches so that they go upstream if we still want
 them after 8.2 is out; however, I don't think the approach used by them 
 actually 
 helps us.  We already have a very limited amount of storage space and 
 reserving 
 space for the root user just reduces what the end user can actually use.
 
 I think analyzing performance of non-JFFS2 file systems and picking
 a replacement should be a high-priority item for 9.1 update.

I'm looking at making btrfs work on pure flash. It looks fairly sane in
that respect. Using a 'standard' file system will have benefits...


-- 
dwmw2

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: NAND out of space crash

2008-07-21 Thread C. Scott Ananian
On Mon, Jul 21, 2008 at 1:55 PM, David Woodhouse [EMAIL PROTECTED] wrote:
 #6480 is fixed as of yesterday, should be in next joyride.

 Yeah. Since it was purely cosmetic I figured it might as well just wait
 to come through 'naturally'.

It's not purely cosmetic: in my testing the bogus accounting affects
the output of 'df', so that sugar thinks there is space available,
even though writes will all fail due to insufficient space.  I should
have noted this more clearly in the bug.
  --scott

-- 
 ( http://cscott.net/ )
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: NAND out of space crash

2008-07-21 Thread C. Scott Ananian
On Mon, Jul 21, 2008 at 1:39 PM, C. Scott Ananian [EMAIL PROTECTED] wrote:
 A number of independent issues here:

I have edited http://dev.laptop.org/ticket/7125 to clarify the pieces
of this bug and to make the component tasks (including #5317) more
obvious.  I have *not* attempted to set milestones or priorities;
that's up to Greg/Michael/the component authors.  Clearly some of
these items are more critical that others; I agree with Deepak and
dwmw2 in that it might be easier/better to fix the root allocation
issue in 9.1 by simply moving to a better filesystem, since the
slows are not the critical item for this bug.

Anyway, please continue the discussion in trac for #7125 and its children.
 --scott

-- 
 ( http://cscott.net/ )
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: NAND out of space crash

2008-07-21 Thread Deepak Saxena
On Jul 21 2008, at 13:55, Jim Gettys was caught saying:
  #6480 is fixed as of yesterday, should be in next joyride.
  
  I'll be re-doing Nokia's patches so that they go upstream if we still want
  them after 8.2 is out; however, I don't think the approach used by them 
  actually 
  helps us.  We already have a very limited amount of storage space and 
  reserving 
  space for the root user just reduces what the end user can actually use.
 
 IIRC, the issue is the GC runs more and more often the closer to full
 you run.  By reserving some space, you avoid the performance cliff.
 
 Since we expect to be running nearly full most of the time, it would
 seem to me avoiding this cliff is important.

I can go ahead and apply the existing Nokia patch into the 8.2 kernel as
a short-term measure but don't want to arbitrarilly choose a reservation size. 
Dave, do you have a suggestion as to what percentage should be reserved to 
keep the GC from going out of control? If not, we'll need to run some
performance tests to find the sweet spot.

~Deepak

-- 
Deepak Saxena [EMAIL PROTECTED]
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: NAND out of space crash

2008-07-21 Thread Erik Garrison
On Mon, Jul 21, 2008 at 01:39:25PM -0400, C. Scott Ananian wrote:
 On Mon, Jul 21, 2008 at 12:52 PM, Jim Gettys [EMAIL PROTECTED] wrote:
  There are two issues here that we should be sure to not intertwingle:
 
  1) whatever behavior Sugar may have when low/out of space, during
  operation, or at boot time.
 
 A number of independent issues here:
  ...
  b) sugar should, ideally, start even if flash is full.   It is
 currently failing when writing to ~olpc/.boot_time or some such, and
 crashing.

In olpc-utils: usr/bin/olpc-session.  This was done for performance
testing work, and I am unaware of other references to the file.  We can
either comment out this stanza or remove it.  I have attached patches to
do either.

Erik
From 3527ba05f79f2a6543baa004a8b6fbf613dcd735 Mon Sep 17 00:00:00 2001
From: Erik Garrison [EMAIL PROTECTED]
Date: Mon, 21 Jul 2008 14:25:23 -0400
Subject: [PATCH] Stop writing ~/.boot_time at startup so we can improve our chances in NAND-fillup land.

---
 usr/bin/olpc-session |7 ---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/usr/bin/olpc-session b/usr/bin/olpc-session
index c50b5f1..a38bd4b 100755
--- a/usr/bin/olpc-session
+++ b/usr/bin/olpc-session
@@ -60,9 +60,10 @@ xset -r 9 -r 220  -r 67 -r 68 -r 69 -r 70 -r 71 -r 72 -r 73 -r 74 -r 79 -r \
 # source custom user session, if present
 [ -f $HOME/.xsession ]  . $HOME/.xsession
 
-# useful for performance work
-mv $HOME/.boot_time $HOME/.boot_time.prev 2/dev/null
-cat /proc/uptime $HOME/.boot_time
+# Uncomment the following lines to save a record of our startup time.
+# This is useful for performance work.
+# mv $HOME/.boot_time $HOME/.boot_time.prev 2/dev/null
+# cat /proc/uptime $HOME/.boot_time
 
 # finally, run sugar
 exec /usr/bin/ck-xinit-session /usr/bin/sugar
-- 
1.5.4.3

From 66cbebe1338dd9167d49b69cb71b4911676bb013 Mon Sep 17 00:00:00 2001
From: Erik Garrison [EMAIL PROTECTED]
Date: Mon, 21 Jul 2008 14:21:06 -0400
Subject: [PATCH] Stop writing ~/.boot_time at startup so we can improve our chances in NAND-fillup land.

---
 usr/bin/olpc-session |4 
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/usr/bin/olpc-session b/usr/bin/olpc-session
index c50b5f1..4a82845 100755
--- a/usr/bin/olpc-session
+++ b/usr/bin/olpc-session
@@ -60,9 +60,5 @@ xset -r 9 -r 220  -r 67 -r 68 -r 69 -r 70 -r 71 -r 72 -r 73 -r 74 -r 79 -r \
 # source custom user session, if present
 [ -f $HOME/.xsession ]  . $HOME/.xsession
 
-# useful for performance work
-mv $HOME/.boot_time $HOME/.boot_time.prev 2/dev/null
-cat /proc/uptime $HOME/.boot_time
-
 # finally, run sugar
 exec /usr/bin/ck-xinit-session /usr/bin/sugar
-- 
1.5.4.3

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: NAND out of space crash

2008-07-21 Thread C. Scott Ananian
On Mon, Jul 21, 2008 at 2:31 PM, Erik Garrison [EMAIL PROTECTED] wrote:
  b) sugar should, ideally, start even if flash is full.   It is
 currently failing when writing to ~olpc/.boot_time or some such, and
 crashing.

 In olpc-utils: usr/bin/olpc-session.  This was done for performance
 testing work, and I am unaware of other references to the file.  We can
 either comment out this stanza or remove it.  I have attached patches to
 do either.

Erik, would you mind claiming #7586 and/or #7587?  I don't think we
need to remove the boot time code; we just need to make sure that the
shell script doesn't exit if it fails.
 --scott

-- 
 ( http://cscott.net/ )
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: NAND out of space crash

2008-07-21 Thread John Watlington

It sounds like you are working on the root causes.
Tday I'm hanging out with the logistics/repair team,
and the problem is worse than I thought this morning.
They are being innundated with new problems caused
by full disk (but weren't really aware that was the cause.)

Since fixes in 8.2 won't help them for months, they need
the short term fix (c).   I will talk to Fiorella and her team
about progress on that tmw.

They also need a way of repairing these in the field.
Mailing them back to LATU for reflashing is costing a fortune.
Over 55% of their returns for repair are fixed by
reflashing/reactivating.

The problem with a teacher reflashing them are two:
1) The teachers don't have activation keys for the machines,
   and Uruguay doesn't want to start giving them out.
2) Currently, there is no monolithic image for Uruguay
   (I was unaware of this, but they say that first they reflash, then
   they activate, then they install the Uruguay specific scripts.)

It seems like we should be able to produce a upgrade and
customize key that does this in one step, and preserves the
activation key for the laptop.

Thoughts ?
wad

On Jul 21, 2008, at 2:39 PM, C. Scott Ananian wrote:

 On Mon, Jul 21, 2008 at 12:52 PM, Jim Gettys [EMAIL PROTECTED] wrote:
 There are two issues here that we should be sure to not intertwingle:

 1) whatever behavior Sugar may have when low/out of space, during
 operation, or at boot time.

 A number of independent issues here:
  a) the initscripts should be sure to unfreeze the dcon if/when X
 fails to start.  This ensures that the system is obviously recoverable
 (you can recover by rebooting with the check key held down, but this
 is not obvious!).
  b) sugar should, ideally, start even if flash is full.   It is
 currently failing when writing to ~olpc/.boot_time or some such, and
 crashing.
  c) once sugar starts, there should be a message indicating that the
 NAND is critically full.
  d) trying to save new content to the journal should also give an
 obvious message that the NAND is full.
  e) removing content from the journal should work even if NAND is  
 full.

 I think (a), (b), and (e) are critical for 8.2.  (c) is being handled
 independently by Uruguay, and (c) and (d) should be targets for 9.1.

 2) JFFS2's behavior when the file system is almost full.  When it  
 gets
 almost full, it can spend all its time trying to garbage collect, and
 you can lose completely (the system sort of gets the slows, and  
 grinds
 to a halt).

 As to 2), there are patches done by Nokia (deployed on the N800 and
 similar devices) that reserve some extra space and report out of  
 space
 before the system gets the slows.  These are in Dave's incoming  
 queue
 to merge into JFFS2 the last I heard.  I don't know if he's merged  
 them.

 These are less critical, IMO.  I have filled up NAND, and the slows
 are not debilitating.  The issues above are. We should encourage Dave
 to fix this issue and the other known JFFS2 bugs (trac #6480, for
 instance)  -- or get dsaxena to do so -- for 9.1.
  --scott

 -- 
  ( http://cscott.net/ )

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: NAND out of space crash

2008-07-21 Thread C. Scott Ananian
On Mon, Jul 21, 2008 at 3:57 PM, John Watlington [EMAIL PROTECTED] wrote:
 It seems like we should be able to produce a upgrade and
 customize key that does this in one step, and preserves the
 activation key for the laptop.

Yes.  The issues in the past have just been coordination-related.  I
believe Emiliano is capable of generating a build image with the
Uruguay scripts installed, which is the first half of the problem.
 --scott

-- 
 ( http://cscott.net/ )
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: NAND out of space crash

2008-07-21 Thread John Gilmore
 They are being innundated with new problems caused
 by full disk (but weren't really aware that was the cause.)
 
 Since fixes in 8.2 won't help them for months, they need
 the short term fix (c).

Mitch added Forth words to delete files from the NAND flash, after
we had similar troubles after Christmas (bug #5744, #5719, #5317):
  
  Changed 7 months ago by [EMAIL PROTECTED]

  OFW q2d07c and later have the ability to delete files from the JFFS2
  filesystem, so long as there is at least one empty page for storing
  the deletion node.

ok dir n:\home\olpc\.sugar\default\data\
ok rm n:\home\olpc\.sugar\default\data\XXX

  where XXX is the name of the file you want to delete.

[I don't know how often there will be no empty page for the deletion
node - I suspect we'll find out.]

I suggest that OLPC figure out a short list of reasonably large files
that we supply on NAND, but which aren't actually needed by most
students (perhaps a language translation for a language they don't use;
or an activity binary that they can easily reinstall later).  Include
that list along with instructions on how to remove one or more of these
files when they get into this jam.

Of course, getting to Forth requires a normal computer (i.e. a
developer key, which every child is entitled to, but apparently no
children actually get).  You can get developer keys, even from a
crashed XO that won't boot NAND, using a collector key, web access,
and a lot of patience.

Somebody who had the sooper secret OLPC script-signing key could write
a Forth script that field teachers could run on crashed lockdown XO's,
which would put them into Forth and let them type.  (Perhaps if you
believe deeply in making security expensive, it can check to see if
the NAND is more than 95% full, and only let them type if so.  Or it
can provide a menu of files for deletion.  Or it can limit itself in
any number of ways, making it less useful but more quote-unquote safe)

John

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: NAND out of space crash

2008-07-21 Thread John Gilmore
I should've said that just removing a couple of useless or easily
replaced files -- rather than reflashing -- means that the kids don't
lose all their work when the NAND fills up.

John

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: NAND out of space crash

2008-07-21 Thread David Woodhouse
On Mon, 2008-07-21 at 10:29 -0700, Deepak Saxena wrote:
 I can go ahead and apply the existing Nokia patch into the 8.2 kernel as
 a short-term measure but don't want to arbitrarilly choose a reservation 
 size. 
 Dave, do you have a suggestion as to what percentage should be reserved to 
 keep the GC from going out of control? If not, we'll need to run some
 performance tests to find the sweet spot.

I don't have a suggestion. But I'd prefer not to apply the overly
complex patch from Artem -- just add a 'root only' threshold and
hard-code it for now (we should really expose _all_ the thresholds in
sysfs).

-- 
dwmw2

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: NAND out of space crash (was Display warnings in sugar (Emiliano Pastorino))

2008-07-19 Thread Greg Smith
Hi All,

Emiliano has an elegant workaround but crashing the XO on NAND full (to 
un-recoverable state?) is a heinous bug that affects essentially all users.

If someone has the bug ID handy can you send it out and mark it a 
blocker for 8.2.0 (priority = blocker and keyword includes blocks:8.2.0)?

Can I get a design proposal (no re-partitioning please!), scoping and 
lead engineer on it ASAP?

If you have to stop working on something else to do this, let me know 
what will drop and I'll help weigh the consequences.

Thanks,

Greg S

[EMAIL PROTECTED] wrote:
 Date: Thu, 17 Jul 2008 15:44:56 -0400
 From: C. Scott Ananian [EMAIL PROTECTED]
 Subject: Re: [sugar] Display warnings in sugar
 To: Tomeu Vizoso [EMAIL PROTECTED]
 Cc: devel@lists.laptop.org, Eben Eliason [EMAIL PROTECTED],
   [EMAIL PROTECTED]
 Message-ID:
   [EMAIL PROTECTED]
 Content-Type: text/plain; charset=ISO-8859-1
 
 On Thu, Jul 17, 2008 at 5:21 AM, Tomeu Vizoso [EMAIL PROTECTED] wrote:
 On Thu, Jul 17, 2008 at 2:27 AM, C. Scott Ananian [EMAIL PROTECTED] wrote:
 I hope our alert system will use the freedesktop.org standard:
  http://www.galago-project.org/specs/notification/index.php
 It is widely used in Gnome, and when I last reviewed it seems to be a
 solid and capable spec.
 The interfaces in that spec look quite good, although perhaps would
 benefit from a simpler, alternative API that also abstracts the D-Bus
 stuff. Perhaps rainbow should do some rate limiting or permissions
 checking, not sure.
 
 Sure, wrap the actual DBus calls with a simplied sugar/python method
 if you like, but *please* let's implement a listener for that API so
 that unmodified applications can interact sensibly with Sugar, and so
 that our system tools  activities can interoperate with non-Sugar
 window managers.
 
 Similarly, we should really implement that standard freedesktop.org
 startup notification spec, so we can get sensible notifications and
 icons for 'ordinary' applications.
  --scott
 
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: NAND out of space crash (was Display warnings in sugar (Emiliano Pastorino))

2008-07-19 Thread Erik Garrison
On Sat, Jul 19, 2008 at 11:47:21AM -0400, Greg Smith wrote:
 Hi All,
 
 Emiliano has an elegant workaround but crashing the XO on NAND full (to 
 un-recoverable state?) is a heinous bug that affects essentially all users.
 
 If someone has the bug ID handy can you send it out and mark it a 
 blocker for 8.2.0 (priority = blocker and keyword includes blocks:8.2.0)?
 
 Can I get a design proposal (no re-partitioning please!), scoping and 
 lead engineer on it ASAP?
 
 If you have to stop working on something else to do this, let me know 
 what will drop and I'll help weigh the consequences.

My impression is that the long-term benefits of partitioning mean that
it's worthwhile to devote effort to it.  Are we not going to work on
partitioning in the future?

In addition to a more solid solution to the NAND fillup issue, we get
the opportunity to improve system performance and upgrade procedures.
Partitioning will allow us to test out LZO data compression for the XO's
filesystems (excluding /boot and /security).  We would expect a
significant i/o performance boost from the use of LZO.  Additionally,
partitioning would improve OFW-level system updates (e.g. copy-nand) by
making it far simpler for the update procedure to leave user data
intact.

That said there are obviously a lot of troubles with partitioning.
Updating an existing system to a partitioned one without mashing user
data is a major issue.

Erik
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: NAND out of space crash (was Display warnings in sugar (Emiliano Pastorino))

2008-07-19 Thread Erik Garrison
On Sat, Jul 19, 2008 at 12:58:13PM -0400, Benjamin M. Schwartz wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Erik Garrison wrote:
 | On Sat, Jul 19, 2008 at 11:47:21AM -0400, Greg Smith wrote:
 | Hi All,
 |
 | Emiliano has an elegant workaround but crashing the XO on NAND full (to
 | un-recoverable state?) is a heinous bug that affects essentially all users.
 |
 | If someone has the bug ID handy can you send it out and mark it a
 | blocker for 8.2.0 (priority = blocker and keyword includes blocks:8.2.0)?
 |
 | Can I get a design proposal (no re-partitioning please!), scoping and
 | lead engineer on it ASAP?
 |
 | If you have to stop working on something else to do this, let me know
 | what will drop and I'll help weigh the consequences.
 |
 | My impression is that the long-term benefits of partitioning mean that
 | it's worthwhile to devote effort to it.  Are we not going to work on
 | partitioning in the future?

 Adding partitioning does not automatically solve the NAND fillup problem.
 ~ The fundamental issue is that Sugar tries to write files on boot, and
 fails to boot if it cannot write those files.

 The correct solution is to make sure that Sugar can boot even if it cannot
 write files.  This change is needed in order to enable booting on full
 NAND, whether or not partitioning is used to separate system and user
 files.  In short, these issues, while related, are largely decoupled, and
 can be attacked separately.

You are absolutely correct.

Partitioning can be used to isolate the system filesystem(s) from the
effects of user-level data creation, and thus mitigate the risk of
fillup of a partition yielding an unbootable system.  However, the
solution is wholly ineffectual wrt. the fillup issue until we ensure
Sugar only needs to write to the partition which we are confident will
have space.  If we are going to check all the file write requirements of
the Sugar shell, we might as well implement the far better solution of
enabling Sugar to boot without writing anything.

Below is a patch to Sugar which resolves the only python-side case of a
file write during startup which I was able to find.

I couldn't find reference to the configuration variables saved in
_save_session_info elsewhere in the sugar repository.  If these
variables are pulled from the config file after Sugar startup, then this
patch is a bad idea on its own.



diff --git a/src/main.py b/src/main.py
index b1ecc93..1899438 100644
--- a/src/main.py
+++ b/src/main.py
@@ -55,15 +55,19 @@ def _save_session_info():
 #do not rely on it
 #
 session_info_file = os.path.join(env.get_profile_path(), session.info)
-f = open(session_info_file, w)
+try:
+f = open(session_info_file, w)
+
+cp = ConfigParser()
+cp.add_section('Session')
+cp.set('Session', 'dbus_address', 
os.environ['DBUS_SESSION_BUS_ADDRESS'])
+cp.set('Session', 'display', gtk.gdk.display_get_default().get_name())
+cp.write(f)
 
-cp = ConfigParser()
-cp.add_section('Session')
-cp.set('Session', 'dbus_address', os.environ['DBUS_SESSION_BUS_ADDRESS'])
-cp.set('Session', 'display', gtk.gdk.display_get_default().get_name())
-cp.write(f)
+f.close()
+except IOError, (errno, sterror):
+logger.error(Could not open session_info_file. %s % sterror)
 
-f.close()
 
 def _setup_translations():
 locale_path = os.path.join(config.prefix, 'share', 'locale')
diff --git a/src/main.py b/src/main.py
index b1ecc93..1899438 100644
--- a/src/main.py
+++ b/src/main.py
@@ -55,15 +55,19 @@ def _save_session_info():
 #do not rely on it
 #
 session_info_file = os.path.join(env.get_profile_path(), session.info)
-f = open(session_info_file, w)
+try:
+f = open(session_info_file, w)
+
+cp = ConfigParser()
+cp.add_section('Session')
+cp.set('Session', 'dbus_address', os.environ['DBUS_SESSION_BUS_ADDRESS'])
+cp.set('Session', 'display', gtk.gdk.display_get_default().get_name())
+cp.write(f)
 
-cp = ConfigParser()
-cp.add_section('Session')
-cp.set('Session', 'dbus_address', os.environ['DBUS_SESSION_BUS_ADDRESS'])
-cp.set('Session', 'display', gtk.gdk.display_get_default().get_name())
-cp.write(f)
+f.close()
+except IOError, (errno, sterror):
+logger.error(Could not open session_info_file. %s % sterror)
 
-f.close()
 
 def _setup_translations():
 locale_path = os.path.join(config.prefix, 'share', 'locale')
___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel


Re: NAND out of space crash (was Display warnings in sugar (Emiliano Pastorino))

2008-07-19 Thread Erik Garrison
disclaimer.
The attached patch is untested and likely insufficient to solve this
problem.

On Sat, Jul 19, 2008 at 01:39:20PM -0400, Erik Garrison wrote:
 On Sat, Jul 19, 2008 at 12:58:13PM -0400, Benjamin M. Schwartz wrote:
  -BEGIN PGP SIGNED MESSAGE-
  Hash: SHA1
 
  Erik Garrison wrote:
  | On Sat, Jul 19, 2008 at 11:47:21AM -0400, Greg Smith wrote:
  | Hi All,
  |
  | Emiliano has an elegant workaround but crashing the XO on NAND full (to
  | un-recoverable state?) is a heinous bug that affects essentially all 
  users.
  |
  | If someone has the bug ID handy can you send it out and mark it a
  | blocker for 8.2.0 (priority = blocker and keyword includes blocks:8.2.0)?
  |
  | Can I get a design proposal (no re-partitioning please!), scoping and
  | lead engineer on it ASAP?
  |
  | If you have to stop working on something else to do this, let me know
  | what will drop and I'll help weigh the consequences.
  |
  | My impression is that the long-term benefits of partitioning mean that
  | it's worthwhile to devote effort to it.  Are we not going to work on
  | partitioning in the future?
 
  Adding partitioning does not automatically solve the NAND fillup problem.
  ~ The fundamental issue is that Sugar tries to write files on boot, and
  fails to boot if it cannot write those files.
 
  The correct solution is to make sure that Sugar can boot even if it cannot
  write files.  This change is needed in order to enable booting on full
  NAND, whether or not partitioning is used to separate system and user
  files.  In short, these issues, while related, are largely decoupled, and
  can be attacked separately.
 
 You are absolutely correct.
 
 Partitioning can be used to isolate the system filesystem(s) from the
 effects of user-level data creation, and thus mitigate the risk of
 fillup of a partition yielding an unbootable system.  However, the
 solution is wholly ineffectual wrt. the fillup issue until we ensure
 Sugar only needs to write to the partition which we are confident will
 have space.  If we are going to check all the file write requirements of
 the Sugar shell, we might as well implement the far better solution of
 enabling Sugar to boot without writing anything.
 
 Below is a patch to Sugar which resolves the only python-side case of a
 file write during startup which I was able to find.
 
 I couldn't find reference to the configuration variables saved in
 _save_session_info elsewhere in the sugar repository.  If these
 variables are pulled from the config file after Sugar startup, then this
 patch is a bad idea on its own.
 
 
 
 diff --git a/src/main.py b/src/main.py
 index b1ecc93..1899438 100644
 --- a/src/main.py
 +++ b/src/main.py
 @@ -55,15 +55,19 @@ def _save_session_info():
  #do not rely on it
  #
  session_info_file = os.path.join(env.get_profile_path(), session.info)
 -f = open(session_info_file, w)
 +try:
 +f = open(session_info_file, w)
 +
 +cp = ConfigParser()
 +cp.add_section('Session')
 +cp.set('Session', 'dbus_address', 
 os.environ['DBUS_SESSION_BUS_ADDRESS'])
 +cp.set('Session', 'display', 
 gtk.gdk.display_get_default().get_name())
 +cp.write(f)
  
 -cp = ConfigParser()
 -cp.add_section('Session')
 -cp.set('Session', 'dbus_address', os.environ['DBUS_SESSION_BUS_ADDRESS'])
 -cp.set('Session', 'display', gtk.gdk.display_get_default().get_name())
 -cp.write(f)
 +f.close()
 +except IOError, (errno, sterror):
 +logger.error(Could not open session_info_file. %s % sterror)
  
 -f.close()
  
  def _setup_translations():
  locale_path = os.path.join(config.prefix, 'share', 'locale')

 diff --git a/src/main.py b/src/main.py
 index b1ecc93..1899438 100644
 --- a/src/main.py
 +++ b/src/main.py
 @@ -55,15 +55,19 @@ def _save_session_info():
  #do not rely on it
  #
  session_info_file = os.path.join(env.get_profile_path(), session.info)
 -f = open(session_info_file, w)
 +try:
 +f = open(session_info_file, w)
 +
 +cp = ConfigParser()
 +cp.add_section('Session')
 +cp.set('Session', 'dbus_address', 
 os.environ['DBUS_SESSION_BUS_ADDRESS'])
 +cp.set('Session', 'display', 
 gtk.gdk.display_get_default().get_name())
 +cp.write(f)
  
 -cp = ConfigParser()
 -cp.add_section('Session')
 -cp.set('Session', 'dbus_address', os.environ['DBUS_SESSION_BUS_ADDRESS'])
 -cp.set('Session', 'display', gtk.gdk.display_get_default().get_name())
 -cp.write(f)
 +f.close()
 +except IOError, (errno, sterror):
 +logger.error(Could not open session_info_file. %s % sterror)
  
 -f.close()
  
  def _setup_translations():
  locale_path = os.path.join(config.prefix, 'share', 'locale')

___
Devel mailing list
Devel@lists.laptop.org
http://lists.laptop.org/listinfo/devel