Re: [MTT users] Corrupted MTT database or incorrucet query

2006-11-13 Thread Ethan Mallove
On Mon, Nov/13/2006 10:56:06AM, Josh Hursey wrote:
> 
> On Nov 13, 2006, at 10:27 AM, Ethan Mallove wrote:
> 
> >I can infer that you have an MPI Install section labeled
> >"odin 64 bit gcc". A few questions:
> >
> >* What is the mpi_get for that section (or how does that
> >  parameter get filled in by your automated scripts)?
> 
> I attached the generated INI file for you to look at.


> 
> It is the same value for all parallel runs of GCC+64bit (same value  
> for all branches)
> 
> 
> >* Do you start with a fresh scratch tree every run?
> 
> Yep. Every run, and all of the parallel runs.
> 
> >* Could you email me your scratch/installs/mpi_installs.xml
> >  files?
> 

> 
>   
> 
>   append_path=""
>
> bindir="/san/homedirs/mpiteam/mtt-runs/odin/20061112-Nightly/parallel-block-1/installs/ompi-nightly-trunk/odin_64_bit_gcc/1.3a1r12559/install/bin"
>c_bindings="1"
>compiler_name="gnu"
>compiler_version="3.4.6"
>configure_arguments="FCFLAGS=-m64 FFLAGS=-m64 CFLAGS=-m64 
> CXXFLAGS=-m64 --with-wrapper-cflags=-m64 --with-wrapper-cxxflags=-m64 
> --with-wrapper-fflags=-m64 --with-wrapper-fcflags=-m64"
>cxx_bindings="1"
>f77_bindings="1"
>f90_bindings="1"
>full_section_name="mpi install: odin 64 bit gcc"
>
> installdir="/san/homedirs/mpiteam/mtt-runs/odin/20061112-Nightly/parallel-block-1/installs/ompi-nightly-trunk/odin_64_bit_gcc/1.3a1r12559/install"
>
> libdir="/san/homedirs/mpiteam/mtt-runs/odin/20061112-Nightly/parallel-block-1/installs/ompi-nightly-trunk/odin_64_bit_gcc/1.3a1r12559/install/lib"
>merge_stdout_stderr="1"
>mpi_details="Open MPI"
>mpi_get_full_section_name="mpi get: ompi-nightly-trunk"
>mpi_get_simple_section_name="ompi-nightly-trunk"
>mpi_version="1.3a1r12559"
>prepend_path=""
>result_message="Success"
>setenv=""
>success="1"
>test_status="installed"
>timestamp="1163316821"
>unsetenv=""
>vpath_mode="none" />
> 
>   
> 

> The attached mpi_installs.xml is from the trunk+gcc+64bit parallel  
> scratch directory.
> 
> >
> >I checked on how widespread this issue is, and found that
> >18,700 out of 474,000 Test Run rows in the past month have a
> >mpi_version/command (v1.2-trunk) mismatch. Occuring in both
> >directions (version=1.2, command=trunk and vice versa).
> >They occur on these clusters:
> >
> > Cisco MPI development cluster
> > IU Odin
> > IU - Thor - TESTING
> >
> 
> Interesting...
> 
> >There *is* that race condition in which one mtt submitting
> >could overwrite another's index. Do you have "trunk" and
> >"1.2" runs submitting to the database at the same time?
> 
> Yes we do. :(
> 
> The parallel blocks as we call them are separate scratch directories  
> in which MTT is running concurrently. Meaning that we have N parallel  
> block scratch directories each running one instance of MTT. So it is  
> possible (and highly likely) that when the reporter phase fires all  
> of the N parallel blocks are firing it about the same time.
>

Likely, because the mtt runs start at the same time? Or because you
do the [Reporter:IU database] section for trunk/v1.2 at the same time?

> Without knowing how the reporter is doing the inserts into the  
> database I don't think I can help much more than that on debugging.  
> When the reporter fires for the DB:
>  - Does it start a transaction for the connection, do the inserts,  
> then commit?
>  - Does it ship the inserts to the server then allow it to run them,  
> or does the client do all of the individual inserts?
>

lib/MTT/Reporter/MTTDatabase.pm HTTP POSTs the results to
server/php/submit/index.php.  index.php iterates over all of
the results and INSERTs them one at a time, but for each
result it checks to see if that MPI Install (hardware, os,
mpi_version, ...) is already in the database. If it is, it
reuses that existing row, otherwise it creates a new row
(and the problem is the SELECT/INSERT is not atomic on that
index).

I'm having a tough time tracking down the race condition in
the postgres logs, so I'm going to change that index to a
serial type now, and see if the problem goes away.


> -- Josh
> 
> >
> >
> >On Sun, Nov/12/2006 06:04:17PM, Jeff Squyres (jsquyres) wrote:
> >>
> >>   I feel somewhat better now.  Ethan - can you fix?
> >>-Original Message-
> >>   From:   Tim Mattox [[1]mailto:timat...@open-mpi.org]
> >>   Sent:   Sunday, November 12, 2006 05:34 PM Eastern Standard Time
> >>   To: General user list for the MPI Testing Tool
> >>   Subject:[MTT users] Corrupted MTT database or  
> >>incorrucet que

Re: [MTT users] Corrupted MTT database or incorrucet query

2006-11-13 Thread Josh Hursey


On Nov 13, 2006, at 10:27 AM, Ethan Mallove wrote:


I can infer that you have an MPI Install section labeled
"odin 64 bit gcc". A few questions:

* What is the mpi_get for that section (or how does that
  parameter get filled in by your automated scripts)?


I attached the generated INI file for you to look at.


nightly-trunk-64-gcc.ini-gen
Description: Binary data


It is the same value for all parallel runs of GCC+64bit (same value  
for all branches)




* Do you start with a fresh scratch tree every run?


Yep. Every run, and all of the parallel runs.


* Could you email me your scratch/installs/mpi_installs.xml
  files?



  

  

  

The attached mpi_installs.xml is from the trunk+gcc+64bit parallel  
scratch directory.




I checked on how widespread this issue is, and found that
18,700 out of 474,000 Test Run rows in the past month have a
mpi_version/command (v1.2-trunk) mismatch. Occuring in both
directions (version=1.2, command=trunk and vice versa).
They occur on these clusters:

 Cisco MPI development cluster
 IU Odin
 IU - Thor - TESTING



Interesting...


There *is* that race condition in which one mtt submitting
could overwrite another's index. Do you have "trunk" and
"1.2" runs submitting to the database at the same time?


Yes we do. :(

The parallel blocks as we call them are separate scratch directories  
in which MTT is running concurrently. Meaning that we have N parallel  
block scratch directories each running one instance of MTT. So it is  
possible (and highly likely) that when the reporter phase fires all  
of the N parallel blocks are firing it about the same time.


Without knowing how the reporter is doing the inserts into the  
database I don't think I can help much more than that on debugging.  
When the reporter fires for the DB:
 - Does it start a transaction for the connection, do the inserts,  
then commit?
 - Does it ship the inserts to the server then allow it to run them,  
or does the client do all of the individual inserts?


-- Josh




On Sun, Nov/12/2006 06:04:17PM, Jeff Squyres (jsquyres) wrote:


   I feel somewhat better now.  Ethan - can you fix?
-Original Message-
   From:   Tim Mattox [[1]mailto:timat...@open-mpi.org]
   Sent:   Sunday, November 12, 2006 05:34 PM Eastern Standard Time
   To: General user list for the MPI Testing Tool
   Subject:[MTT users] Corrupted MTT database or  
incorrucet query

   Hello,
   I just noticed that the MTT summary page is presenting
   incorrect information for our recent runs at IU.  It is
   showing failures for the 1.2b1 that actaully came from
   the trunk!  See the first entry in this table:
   http://www.open-mpi.org/mtt/reporter.php? 
&maf_start_test_timestamp=200
   6-11-12%2019:12:02%20through%202006-11-12% 
2022:12:02&ft_platform_id=co

ntains&tf_platform_id=IU&maf_phase=runs&maf_success=fail&by_atom=*by_ 
t
   est_case&go=Table&maf_agg_timestamp=- 
&mef_mpi_name=All&mef_mpi_version

=All&mef_os_name=All&mef_os_version=All&mef_platform_hardware=All&mef 
_
   platform_id=All&agg_platform_id=off&1- 
page=off&no_bookmarks&no_bookmar

   ks
   Click on the [i] in the upper right (the first entry)
   to get the popup window which shows the MPIRrun cmd as:
   mpirun -mca btl tcp,sm,self -np 6 --prefix
   /san/homedirs/mpiteam/mtt-runs/odin/20061112-Testing-NOCLN/ 
parallel-bl
   ock-3/installs/ompi-nightly-trunk/odin_64_bit_gcc/1.3a1r12559/ 
install

   dynamic/spawn Note the path has "1.3a1r12559" in the
   name... it's a run from the trunk, yet the table showed
   this as a 1.2b1 run.  There are several of these
   missattributed errors.  This would explain why Jeff saw
   some ddt errors on the 1.2 brach yesterday, but was
   unable to reproduce them.  They were from the trunk!
   --
   Tim Mattox - [2]http://homepage.mac.com/tmattox/
tmat...@gmail.com || timat...@open-mpi.org
   I'm a bright... [3]http://www.the-brights.net/
   ___
   mtt-users mailing list
   mtt-us...@open-mpi.org
   [4]http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users

References

   1. mailto:timat...@open-mpi.org
   2. http://homepage.mac.com/tmattox/
   3. http://www.the-brights.net/
   4. http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users



___
mtt-users mailing list
mtt-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users



--
-Ethan
___
mtt-users mailing list
mtt-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users



Josh Hursey
jjhur...@open-mpi.org
http://www.open-mpi.org/



Re: [MTT users] Corrupted MTT database or incorrucet query

2006-11-13 Thread Ethan Mallove
I can infer that you have an MPI Install section labeled
"odin 64 bit gcc". A few questions:

* What is the mpi_get for that section (or how does that
  parameter get filled in by your automated scripts)?  
* Do you start with a fresh scratch tree every run?
* Could you email me your scratch/installs/mpi_installs.xml
  files?

I checked on how widespread this issue is, and found that
18,700 out of 474,000 Test Run rows in the past month have a
mpi_version/command (v1.2-trunk) mismatch. Occuring in both
directions (version=1.2, command=trunk and vice versa).
They occur on these clusters:

 Cisco MPI development cluster
 IU Odin
 IU - Thor - TESTING

There *is* that race condition in which one mtt submitting
could overwrite another's index. Do you have "trunk" and
"1.2" runs submitting to the database at the same time?


On Sun, Nov/12/2006 06:04:17PM, Jeff Squyres (jsquyres) wrote:
> 
>I feel somewhat better now.  Ethan - can you fix?
> -Original Message-
>From:   Tim Mattox [[1]mailto:timat...@open-mpi.org]
>Sent:   Sunday, November 12, 2006 05:34 PM Eastern Standard Time
>To: General user list for the MPI Testing Tool
>Subject:[MTT users] Corrupted MTT database or incorrucet query
>Hello,
>I just noticed that the MTT summary page is presenting
>incorrect information for our recent runs at IU.  It is
>showing failures for the 1.2b1 that actaully came from
>the trunk!  See the first entry in this table:
>http://www.open-mpi.org/mtt/reporter.php?&maf_start_test_timestamp=200
>6-11-12%2019:12:02%20through%202006-11-12%2022:12:02&ft_platform_id=co
>ntains&tf_platform_id=IU&maf_phase=runs&maf_success=fail&by_atom=*by_t
>est_case&go=Table&maf_agg_timestamp=-&mef_mpi_name=All&mef_mpi_version
>=All&mef_os_name=All&mef_os_version=All&mef_platform_hardware=All&mef_
>platform_id=All&agg_platform_id=off&1-page=off&no_bookmarks&no_bookmar
>ks
>Click on the [i] in the upper right (the first entry)
>to get the popup window which shows the MPIRrun cmd as:
>mpirun -mca btl tcp,sm,self -np 6 --prefix
>/san/homedirs/mpiteam/mtt-runs/odin/20061112-Testing-NOCLN/parallel-bl
>ock-3/installs/ompi-nightly-trunk/odin_64_bit_gcc/1.3a1r12559/install
>dynamic/spawn Note the path has "1.3a1r12559" in the
>name... it's a run from the trunk, yet the table showed
>this as a 1.2b1 run.  There are several of these
>missattributed errors.  This would explain why Jeff saw
>some ddt errors on the 1.2 brach yesterday, but was
>unable to reproduce them.  They were from the trunk!
>--
>Tim Mattox - [2]http://homepage.mac.com/tmattox/
> tmat...@gmail.com || timat...@open-mpi.org
>I'm a bright... [3]http://www.the-brights.net/
>___
>mtt-users mailing list
>mtt-us...@open-mpi.org
>[4]http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
> 
> References
> 
>1. mailto:timat...@open-mpi.org
>2. http://homepage.mac.com/tmattox/
>3. http://www.the-brights.net/
>4. http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users

> ___
> mtt-users mailing list
> mtt-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users


-- 
-Ethan