Re: Bottom up build

2014-01-31 Thread Andre Fischer

On 30.01.2014 23:10, Rob Weir wrote:

On Wed, Jan 29, 2014 at 4:18 AM, Andre Fischer awf@gmail.com wrote:

I would like to report some observations that I made when thinking about how
to make building OpenOffice with one global makefile feasible.  It will
probably the last of build related mails in the near future.

Traditional make uses a top-down approach.  It starts with a target, 'all'
by default, and looks at its dependencies.  When one of those has to be made
or is newer then the target then the target also has to be made.  This is
done recursively and depth-first.  Every file on which 'all' has a direct or
indirect dependency has to be checked.  If we would build OpenOffice with
one makefile (included makefiles don't count) then that are a lot of files
to check.  There are about 9800 cxx files, 3500 c files, 12000 hxx files,
and lot of external headers.  Checking the modification time of so many
files is one of the reasons for the delay in , say, sw/ between starting
make and its first action.

As I don't have all global dependencies in a format that would allow
experimation, I tried how long it would take to get the mtime (time of last
modification) for all files, source, generated, compiled, linked, about
12.  I wrote a small Java program for that.  With a warm cache that
takes about 23s.  When run in 4 threads this reduced to less than 8s.  Could
be worse.

But it also could be better because in general there are only a few files
modified, usually files that you modified yourself in an editor.  There is
another approach, introduced, as far as I know, by the tup [1] build tool,
that is bottom up.  If you had something similar to the oracle of complexity
theory, that gave you the list of modified files since the last build, you
could find the depending files much faster.  Faster for two reasons.
Firstly, there is only one path in the dependency tree up towards the root
(while there are many down from the root).  Only targets on this path are
affected by the modified file. Secondly, the dependency analysis is
comparatively cheap.  The expensive part is to determine the file
modification times.  If they where miraculously given then even the top-down
approach would not take noticably longer.

So I made another experiment to see if such an oracle can be created.  Java
7 has the java.nio.file.WatchService that lets you monitor file
modfifications in one directory.  I registered it to all directories in our
source tree (some 16000 directories).  With the WatchService in place every
file modification can be recorded and stored for later.  On the next build
you only have to check the set of modified files, not all files.
Registering the directory watchers takes a couple of seconds but after that
it does not cause any noticeable CPU activity. Any file modifications are
reported almost at once.  I do not have the framework in place to start a
build with this information but I would expect it to be as fast as compiling
the modified files and linking takes.

The tup website references a paper [2] in which the established top-down
approaches are called alpha alogithms and the new bottom-up approach is
called beta algorithm. Tup has implemented a file modification watcher (in C
or C++) only for Linux.  On Windows it just scans all files (for which it
needs a little more time than my Java program, maybe it does not use more
than one thread).


This is something that we should keep in mind for when we ever should get a
build solution with global dependencies and this build tool would turn out
to be too slow.


If can find the source code of my Java experiments at [3]. If nothing else
you can see an application of the ForkJoinPool that allowed my to write the
parallel file system scan in just a few lines.  There is also an alternative
implementation that uses the ExecutorService (with a fixed thread pool)
which needs a few more lines of code.  And there is of course the use of the
WatchService.


Has anyone read this book?

http://www.amazon.com/Large-Scale-Software-Design-John-Lakos/dp/0201633620

It was on my list to read for many years.   From what I've seen it
suggests design approaches to the improve build times.  So things that
go beyond what you can do by just changing build files, more
fundamental changes to how interfaces are defined.


I have not but I might, thanks for the hint.

I agree that improving our software design is always a good idea but I 
would not change our design just to make the build faster.  Many of our 
code related problems are caused by the design and the limitations of 
the build system.   Examples are the individual building of directories 
(in dmake modules) and creation of one library per directory or by the 
many ugly tricks that avoid becoming incompatible (the need to compile 
files in other modules).




Otherwise I wonder if we're trying to optimize a bubble sort?


I would smile about this if I had not seen a handcrafted sort algorithm 
in our build scripts that 

Re: Bottom up build

2014-01-30 Thread Kay Schenk
On Wed, Jan 29, 2014 at 1:54 AM, jan i j...@apache.org wrote:

 On 29 January 2014 10:18, Andre Fischer awf@gmail.com wrote:

  I would like to report some observations that I made when thinking about
  how to make building OpenOffice with one global makefile feasible.  It
 will
  probably the last of build related mails in the near future.
 
  Traditional make uses a top-down approach.  It starts with a target,
 'all'
  by default, and looks at its dependencies.  When one of those has to be
  made or is newer then the target then the target also has to be made.
  This
  is done recursively and depth-first.  Every file on which 'all' has a
  direct or indirect dependency has to be checked.  If we would build
  OpenOffice with one makefile (included makefiles don't count) then that
 are
  a lot of files to check.  There are about 9800 cxx files, 3500 c files,
  12000 hxx files, and lot of external headers.  Checking the modification
  time of so many files is one of the reasons for the delay in , say, sw/
  between starting make and its first action.
 
  As I don't have all global dependencies in a format that would allow
  experimation, I tried how long it would take to get the mtime (time of
 last
  modification) for all files, source, generated, compiled, linked, about
  12.  I wrote a small Java program for that.  With a warm cache that
  takes about 23s.  When run in 4 threads this reduced to less than 8s.
   Could be worse.
 
  But it also could be better because in general there are only a few files
  modified, usually files that you modified yourself in an editor.  There
 is
  another approach, introduced, as far as I know, by the tup [1] build
 tool,
  that is bottom up.  If you had something similar to the oracle of
  complexity theory, that gave you the list of modified files since the
 last
  build, you could find the depending files much faster.  Faster for two
  reasons. Firstly, there is only one path in the dependency tree up
 towards
  the root (while there are many down from the root).  Only targets on this
  path are affected by the modified file. Secondly, the dependency analysis
  is comparatively cheap.  The expensive part is to determine the file
  modification times.  If they where miraculously given then even the
  top-down approach would not take noticably longer.
 
  So I made another experiment to see if such an oracle can be created.
   Java 7 has the java.nio.file.WatchService that lets you monitor file
  modfifications in one directory.  I registered it to all directories in
 our
  source tree (some 16000 directories).  With the WatchService in place
 every
  file modification can be recorded and stored for later.  On the next
 build
  you only have to check the set of modified files, not all files.
   Registering the directory watchers takes a couple of seconds but after
  that it does not cause any noticeable CPU activity. Any file
 modifications
  are reported almost at once.  I do not have the framework in place to
 start
  a build with this information but I would expect it to be as fast as
  compiling the modified files and linking takes.
 
  The tup website references a paper [2] in which the established top-down
  approaches are called alpha alogithms and the new bottom-up approach is
  called beta algorithm. Tup has implemented a file modification watcher
 (in
  C or C++) only for Linux.  On Windows it just scans all files (for which
 it
  needs a little more time than my Java program, maybe it does not use more
  than one thread).
 
 
  This is something that we should keep in mind for when we ever should get
  a build solution with global dependencies and this build tool would turn
  out to be too slow.
 
 
  If can find the source code of my Java experiments at [3]. If nothing
 else
  you can see an application of the ForkJoinPool that allowed my to write
 the
  parallel file system scan in just a few lines.  There is also an
  alternative implementation that uses the ExecutorService (with a fixed
  thread pool) which needs a few more lines of code.  And there is of
 course
  the use of the WatchService.
 


It's really interesting to read these observations and test cases... we
have a large and complicated source tree and just seeing what can be
observed about it is fascinating to me.

Thanks for writing down your observations which I find highly interesting.
 I hope your stop on writing about build does not include giving your
 opinion on my ideas in the future as well.

 For the record the capstone project, and my little hobby project Build
 R.I.P. follow a third idea:

 We have a clear seperation of  module build and central (total) build.


+1 I would certainly go for this.

 A while back someone asked about ye olde ld approach -- all modules
compiled/built and then linked later down the road. If we could somehow do
something to get back to that idea in a more friendly modern way, it would
certainly make working on specific areas more feasible.




 

Re: Bottom up build

2014-01-30 Thread Rob Weir
On Wed, Jan 29, 2014 at 4:18 AM, Andre Fischer awf@gmail.com wrote:
 I would like to report some observations that I made when thinking about how
 to make building OpenOffice with one global makefile feasible.  It will
 probably the last of build related mails in the near future.

 Traditional make uses a top-down approach.  It starts with a target, 'all'
 by default, and looks at its dependencies.  When one of those has to be made
 or is newer then the target then the target also has to be made.  This is
 done recursively and depth-first.  Every file on which 'all' has a direct or
 indirect dependency has to be checked.  If we would build OpenOffice with
 one makefile (included makefiles don't count) then that are a lot of files
 to check.  There are about 9800 cxx files, 3500 c files, 12000 hxx files,
 and lot of external headers.  Checking the modification time of so many
 files is one of the reasons for the delay in , say, sw/ between starting
 make and its first action.

 As I don't have all global dependencies in a format that would allow
 experimation, I tried how long it would take to get the mtime (time of last
 modification) for all files, source, generated, compiled, linked, about
 12.  I wrote a small Java program for that.  With a warm cache that
 takes about 23s.  When run in 4 threads this reduced to less than 8s.  Could
 be worse.

 But it also could be better because in general there are only a few files
 modified, usually files that you modified yourself in an editor.  There is
 another approach, introduced, as far as I know, by the tup [1] build tool,
 that is bottom up.  If you had something similar to the oracle of complexity
 theory, that gave you the list of modified files since the last build, you
 could find the depending files much faster.  Faster for two reasons.
 Firstly, there is only one path in the dependency tree up towards the root
 (while there are many down from the root).  Only targets on this path are
 affected by the modified file. Secondly, the dependency analysis is
 comparatively cheap.  The expensive part is to determine the file
 modification times.  If they where miraculously given then even the top-down
 approach would not take noticably longer.

 So I made another experiment to see if such an oracle can be created.  Java
 7 has the java.nio.file.WatchService that lets you monitor file
 modfifications in one directory.  I registered it to all directories in our
 source tree (some 16000 directories).  With the WatchService in place every
 file modification can be recorded and stored for later.  On the next build
 you only have to check the set of modified files, not all files.
 Registering the directory watchers takes a couple of seconds but after that
 it does not cause any noticeable CPU activity. Any file modifications are
 reported almost at once.  I do not have the framework in place to start a
 build with this information but I would expect it to be as fast as compiling
 the modified files and linking takes.

 The tup website references a paper [2] in which the established top-down
 approaches are called alpha alogithms and the new bottom-up approach is
 called beta algorithm. Tup has implemented a file modification watcher (in C
 or C++) only for Linux.  On Windows it just scans all files (for which it
 needs a little more time than my Java program, maybe it does not use more
 than one thread).


 This is something that we should keep in mind for when we ever should get a
 build solution with global dependencies and this build tool would turn out
 to be too slow.


 If can find the source code of my Java experiments at [3]. If nothing else
 you can see an application of the ForkJoinPool that allowed my to write the
 parallel file system scan in just a few lines.  There is also an alternative
 implementation that uses the ExecutorService (with a fixed thread pool)
 which needs a few more lines of code.  And there is of course the use of the
 WatchService.


Has anyone read this book?

http://www.amazon.com/Large-Scale-Software-Design-John-Lakos/dp/0201633620

It was on my list to read for many years.   From what I've seen it
suggests design approaches to the improve build times.  So things that
go beyond what you can do by just changing build files, more
fundamental changes to how interfaces are defined.

Otherwise I wonder if we're trying to optimize a bubble sort?

-Rob


 Regards,
 Andre


 [1] http://gittup.org/tup/
 [2] http://gittup.org/tup/build_system_rules_and_algorithms.pdf
 [3] http://people.apache.org/~af/test.zip

 -
 To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org
 For additional commands, e-mail: dev-h...@openoffice.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org
For additional commands, e-mail: dev-h...@openoffice.apache.org



Re: Bottom up build

2014-01-30 Thread jan i
On 30 January 2014 23:10, Rob Weir robw...@apache.org wrote:

 On Wed, Jan 29, 2014 at 4:18 AM, Andre Fischer awf@gmail.com wrote:
  I would like to report some observations that I made when thinking about
 how
  to make building OpenOffice with one global makefile feasible.  It will
  probably the last of build related mails in the near future.
 
  Traditional make uses a top-down approach.  It starts with a target,
 'all'
  by default, and looks at its dependencies.  When one of those has to be
 made
  or is newer then the target then the target also has to be made.  This is
  done recursively and depth-first.  Every file on which 'all' has a
 direct or
  indirect dependency has to be checked.  If we would build OpenOffice with
  one makefile (included makefiles don't count) then that are a lot of
 files
  to check.  There are about 9800 cxx files, 3500 c files, 12000 hxx files,
  and lot of external headers.  Checking the modification time of so many
  files is one of the reasons for the delay in , say, sw/ between starting
  make and its first action.
 
  As I don't have all global dependencies in a format that would allow
  experimation, I tried how long it would take to get the mtime (time of
 last
  modification) for all files, source, generated, compiled, linked, about
  12.  I wrote a small Java program for that.  With a warm cache that
  takes about 23s.  When run in 4 threads this reduced to less than 8s.
  Could
  be worse.
 
  But it also could be better because in general there are only a few files
  modified, usually files that you modified yourself in an editor.  There
 is
  another approach, introduced, as far as I know, by the tup [1] build
 tool,
  that is bottom up.  If you had something similar to the oracle of
 complexity
  theory, that gave you the list of modified files since the last build,
 you
  could find the depending files much faster.  Faster for two reasons.
  Firstly, there is only one path in the dependency tree up towards the
 root
  (while there are many down from the root).  Only targets on this path are
  affected by the modified file. Secondly, the dependency analysis is
  comparatively cheap.  The expensive part is to determine the file
  modification times.  If they where miraculously given then even the
 top-down
  approach would not take noticably longer.
 
  So I made another experiment to see if such an oracle can be created.
  Java
  7 has the java.nio.file.WatchService that lets you monitor file
  modfifications in one directory.  I registered it to all directories in
 our
  source tree (some 16000 directories).  With the WatchService in place
 every
  file modification can be recorded and stored for later.  On the next
 build
  you only have to check the set of modified files, not all files.
  Registering the directory watchers takes a couple of seconds but after
 that
  it does not cause any noticeable CPU activity. Any file modifications are
  reported almost at once.  I do not have the framework in place to start a
  build with this information but I would expect it to be as fast as
 compiling
  the modified files and linking takes.
 
  The tup website references a paper [2] in which the established top-down
  approaches are called alpha alogithms and the new bottom-up approach is
  called beta algorithm. Tup has implemented a file modification watcher
 (in C
  or C++) only for Linux.  On Windows it just scans all files (for which it
  needs a little more time than my Java program, maybe it does not use more
  than one thread).
 
 
  This is something that we should keep in mind for when we ever should
 get a
  build solution with global dependencies and this build tool would turn
 out
  to be too slow.
 
 
  If can find the source code of my Java experiments at [3]. If nothing
 else
  you can see an application of the ForkJoinPool that allowed my to write
 the
  parallel file system scan in just a few lines.  There is also an
 alternative
  implementation that uses the ExecutorService (with a fixed thread pool)
  which needs a few more lines of code.  And there is of course the use of
 the
  WatchService.
 

 Has anyone read this book?

 http://www.amazon.com/Large-Scale-Software-Design-John-Lakos/dp/0201633620

 It was on my list to read for many years.   From what I've seen it
 suggests design approaches to the improve build times.  So things that
 go beyond what you can do by just changing build files, more
 fundamental changes to how interfaces are defined.


Have read it, the book goes more into C++ structures and design, than the
actual build process.

If you have a pure C++ project, you can do a lot of speed improvement by
definining the classes for speed instead of purity.

Its quite a good book, but have very little for the AOO build system.



 Otherwise I wonder if we're trying to optimize a bubble sort?


No we are trying to moving away from 3-4 build components trying to do the
same thing and each sub-optimized.

In other words, our 

Bottom up build

2014-01-29 Thread Andre Fischer
I would like to report some observations that I made when thinking about 
how to make building OpenOffice with one global makefile feasible.  It 
will probably the last of build related mails in the near future.


Traditional make uses a top-down approach.  It starts with a target, 
'all' by default, and looks at its dependencies.  When one of those has 
to be made or is newer then the target then the target also has to be 
made.  This is done recursively and depth-first.  Every file on which 
'all' has a direct or indirect dependency has to be checked.  If we 
would build OpenOffice with one makefile (included makefiles don't 
count) then that are a lot of files to check.  There are about 9800 cxx 
files, 3500 c files, 12000 hxx files, and lot of external headers.  
Checking the modification time of so many files is one of the reasons 
for the delay in , say, sw/ between starting make and its first action.


As I don't have all global dependencies in a format that would allow 
experimation, I tried how long it would take to get the mtime (time of 
last modification) for all files, source, generated, compiled, linked, 
about 12.  I wrote a small Java program for that.  With a warm cache 
that takes about 23s.  When run in 4 threads this reduced to less than 
8s.  Could be worse.


But it also could be better because in general there are only a few 
files modified, usually files that you modified yourself in an editor.  
There is another approach, introduced, as far as I know, by the tup [1] 
build tool, that is bottom up.  If you had something similar to the 
oracle of complexity theory, that gave you the list of modified files 
since the last build, you could find the depending files much faster.  
Faster for two reasons. Firstly, there is only one path in the 
dependency tree up towards the root (while there are many down from the 
root).  Only targets on this path are affected by the modified file. 
Secondly, the dependency analysis is comparatively cheap.  The expensive 
part is to determine the file modification times.  If they where 
miraculously given then even the top-down approach would not take 
noticably longer.


So I made another experiment to see if such an oracle can be created.  
Java 7 has the java.nio.file.WatchService that lets you monitor file 
modfifications in one directory.  I registered it to all directories in 
our source tree (some 16000 directories).  With the WatchService in 
place every file modification can be recorded and stored for later.  On 
the next build you only have to check the set of modified files, not all 
files.  Registering the directory watchers takes a couple of seconds but 
after that it does not cause any noticeable CPU activity. Any file 
modifications are reported almost at once.  I do not have the framework 
in place to start a build with this information but I would expect it to 
be as fast as compiling the modified files and linking takes.


The tup website references a paper [2] in which the established top-down 
approaches are called alpha alogithms and the new bottom-up approach is 
called beta algorithm. Tup has implemented a file modification watcher 
(in C or C++) only for Linux.  On Windows it just scans all files (for 
which it needs a little more time than my Java program, maybe it does 
not use more than one thread).



This is something that we should keep in mind for when we ever should 
get a build solution with global dependencies and this build tool would 
turn out to be too slow.



If can find the source code of my Java experiments at [3]. If nothing 
else you can see an application of the ForkJoinPool that allowed my to 
write the parallel file system scan in just a few lines.  There is also 
an alternative implementation that uses the ExecutorService (with a 
fixed thread pool) which needs a few more lines of code.  And there is 
of course the use of the WatchService.



Regards,
Andre


[1] http://gittup.org/tup/
[2] http://gittup.org/tup/build_system_rules_and_algorithms.pdf
[3] http://people.apache.org/~af/test.zip

-
To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org
For additional commands, e-mail: dev-h...@openoffice.apache.org



Re: Bottom up build

2014-01-29 Thread jan i
On 29 January 2014 10:18, Andre Fischer awf@gmail.com wrote:

 I would like to report some observations that I made when thinking about
 how to make building OpenOffice with one global makefile feasible.  It will
 probably the last of build related mails in the near future.

 Traditional make uses a top-down approach.  It starts with a target, 'all'
 by default, and looks at its dependencies.  When one of those has to be
 made or is newer then the target then the target also has to be made.  This
 is done recursively and depth-first.  Every file on which 'all' has a
 direct or indirect dependency has to be checked.  If we would build
 OpenOffice with one makefile (included makefiles don't count) then that are
 a lot of files to check.  There are about 9800 cxx files, 3500 c files,
 12000 hxx files, and lot of external headers.  Checking the modification
 time of so many files is one of the reasons for the delay in , say, sw/
 between starting make and its first action.

 As I don't have all global dependencies in a format that would allow
 experimation, I tried how long it would take to get the mtime (time of last
 modification) for all files, source, generated, compiled, linked, about
 12.  I wrote a small Java program for that.  With a warm cache that
 takes about 23s.  When run in 4 threads this reduced to less than 8s.
  Could be worse.

 But it also could be better because in general there are only a few files
 modified, usually files that you modified yourself in an editor.  There is
 another approach, introduced, as far as I know, by the tup [1] build tool,
 that is bottom up.  If you had something similar to the oracle of
 complexity theory, that gave you the list of modified files since the last
 build, you could find the depending files much faster.  Faster for two
 reasons. Firstly, there is only one path in the dependency tree up towards
 the root (while there are many down from the root).  Only targets on this
 path are affected by the modified file. Secondly, the dependency analysis
 is comparatively cheap.  The expensive part is to determine the file
 modification times.  If they where miraculously given then even the
 top-down approach would not take noticably longer.

 So I made another experiment to see if such an oracle can be created.
  Java 7 has the java.nio.file.WatchService that lets you monitor file
 modfifications in one directory.  I registered it to all directories in our
 source tree (some 16000 directories).  With the WatchService in place every
 file modification can be recorded and stored for later.  On the next build
 you only have to check the set of modified files, not all files.
  Registering the directory watchers takes a couple of seconds but after
 that it does not cause any noticeable CPU activity. Any file modifications
 are reported almost at once.  I do not have the framework in place to start
 a build with this information but I would expect it to be as fast as
 compiling the modified files and linking takes.

 The tup website references a paper [2] in which the established top-down
 approaches are called alpha alogithms and the new bottom-up approach is
 called beta algorithm. Tup has implemented a file modification watcher (in
 C or C++) only for Linux.  On Windows it just scans all files (for which it
 needs a little more time than my Java program, maybe it does not use more
 than one thread).


 This is something that we should keep in mind for when we ever should get
 a build solution with global dependencies and this build tool would turn
 out to be too slow.


 If can find the source code of my Java experiments at [3]. If nothing else
 you can see an application of the ForkJoinPool that allowed my to write the
 parallel file system scan in just a few lines.  There is also an
 alternative implementation that uses the ExecutorService (with a fixed
 thread pool) which needs a few more lines of code.  And there is of course
 the use of the WatchService.


Thanks for writing down your observations which I find highly interesting.
I hope your stop on writing about build does not include giving your
opinion on my ideas in the future as well.

For the record the capstone project, and my little hobby project Build
R.I.P. follow a third idea:

We have a clear seperation of  module build and central (total) build. The
module makefile knows how to build the module, and the central makefile
knows the relation between modules.

The makefile in each module touched a file, and the central makefile only
controls that file.

But youir idea of watching for changes is very interesting.

rgds
jan I.


 Andre


 [1] http://gittup.org/tup/
 [2] http://gittup.org/tup/build_system_rules_and_algorithms.pdf
 [3] http://people.apache.org/~af/test.zip

 -
 To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org
 For additional commands, e-mail: dev-h...@openoffice.apache.org