Re: [Libmesh-devel] Job Hanging

2013-04-19 Thread Roy Stogner
On Fri, 19 Apr 2013, Derek Gaston wrote: Well - of course I spoke too soon.  It just crashed again... but this time I got a core dump check out the stack trace: #0  variant_filter_iterator::Pred<__gnu_cxx::__normal_iterator > >, libMesh::Predicates::Local<__gnu_cxx::__normal_iterator > > >

Re: [Libmesh-devel] Job Hanging

2013-04-19 Thread Derek Gaston
Well - of course I spoke too soon. It just crashed again... but this time I got a core dump check out the stack trace: #0 variant_filter_iterator::Pred<__gnu_cxx::__normal_iterator > >, libMesh::Predicates::Local<__gnu_cxx::__normal_iterator > > > >::operator() (this=0x36d9f950, in=) at ./in

Re: [Libmesh-devel] Job Hanging

2013-04-19 Thread Roy Stogner
On Fri, 19 Apr 2013, Derek Gaston wrote: Just to put an end-cap on this I switched over to the newest mvapich (1.9b) and all of this stuff cleared up.  It's still not clear to me what the issue is/was but it's working ;-) That's a relief! Thanks for keeping us updated! It'd be nice

Re: [Libmesh-devel] Job Hanging

2013-04-19 Thread Derek Gaston
Just to put an end-cap on this I switched over to the newest mvapich (1.9b) and all of this stuff cleared up. It's still not clear to me what the issue is/was but it's working ;-) Derek On Wed, Apr 10, 2013 at 12:55 PM, Kirk, Benjamin (JSC-EG311) < benjamin.kir...@nasa.gov> wrote: > >

Re: [Libmesh-devel] Job Hanging

2013-04-10 Thread Kirk, Benjamin (JSC-EG311)
On Apr 10, 2013, at 1:44 PM, "Kirk, Benjamin (JSC-EG311)" wrote: > On Apr 10, 2013, at 1:42 PM, Derek Gaston wrote: > >> >> OOoh! Really? That sounds perfect! I'll try it! > > Please try my barrier() first though, referring to the email that just > crossed this one… And a no-op won

Re: [Libmesh-devel] Job Hanging

2013-04-10 Thread Kirk, Benjamin (JSC-EG311)
On Apr 10, 2013, at 1:42 PM, Derek Gaston wrote: > > OOoh! Really? That sounds perfect! I'll try it! Please try my barrier() first though, referring to the email that just crossed this one… -Ben -- Precog is

Re: [Libmesh-devel] Job Hanging

2013-04-10 Thread Kirk, Benjamin (JSC-EG311)
On Apr 10, 2013, at 1:18 PM, "Kirk, Benjamin (JSC-EG311)" wrote: >> >> Anyone sleep on this and come up with any ideas to try? > > I'm reviewing the code now… Is there a restart file with this case, or is it > a fresh start? I'm curious if we have a good-old-fasioned race condition here. W

Re: [Libmesh-devel] Job Hanging

2013-04-10 Thread Derek Gaston
On Wed, Apr 10, 2013 at 12:18 PM, Kirk, Benjamin (JSC-EG311) < benjamin.kir...@nasa.gov> wrote: > I'm reviewing the code now… Is there a restart file with this case, or is > it a fresh start? > Fresh Start > If the latter, you may be able to turn > libMesh::MeshCommunication::find_global_indi

Re: [Libmesh-devel] Job Hanging

2013-04-10 Thread Kirk, Benjamin (JSC-EG311)
On Apr 10, 2013, at 12:57 PM, Derek Gaston wrote: > Anyone sleep on this and come up with any ideas to try? I'm reviewing the code now… Is there a restart file with this case, or is it a fresh start? If the latter, you may be able to turn libMesh::MeshCommunication::find_global_indices() in

Re: [Libmesh-devel] Job Hanging

2013-04-10 Thread Derek Gaston
Anyone sleep on this and come up with any ideas to try? One thing to note is that we are actually reading the mesh on every processor still (because of the block / sideset naming stuff that Cody only recently fixed). Do you believe that could be part of the problem? Currently I can't run over ab

Re: [Libmesh-devel] Job Hanging

2013-04-09 Thread Derek Gaston
Another data point... job starts fine on half the procs Derek On Tue, Apr 9, 2013 at 8:26 PM, Derek Gaston wrote: > Is there any way to disable the hilbert stuff for now? With serial mesh > can we just take the numbering from the node numbering? > > > On Tue, Apr 9, 2013 at 8:21 PM, Derek

Re: [Libmesh-devel] Job Hanging

2013-04-09 Thread Derek Gaston
Is there any way to disable the hilbert stuff for now? With serial mesh can we just take the numbering from the node numbering? On Tue, Apr 9, 2013 at 8:21 PM, Derek Gaston wrote: > serial > > > On Tue, Apr 9, 2013 at 8:21 PM, Kirk, Benjamin (JSC-EG311) < > benjamin.kir...@nasa.gov> wrote: > >

Re: [Libmesh-devel] Job Hanging

2013-04-09 Thread Derek Gaston
serial On Tue, Apr 9, 2013 at 8:21 PM, Kirk, Benjamin (JSC-EG311) < benjamin.kir...@nasa.gov> wrote: > Serial or parallel mesh? > > > > On Apr 9, 2013, at 9:16 PM, "Kirk, Benjamin (JSC-EG311)" < > benjamin.kir...@nasa.gov> wrote: > > > Hmm - I'll look through that section of code tomorrow mornin

Re: [Libmesh-devel] Job Hanging

2013-04-09 Thread Kirk, Benjamin (JSC-EG311)
Serial or parallel mesh? On Apr 9, 2013, at 9:16 PM, "Kirk, Benjamin (JSC-EG311)" wrote: > Hmm - I'll look through that section of code tomorrow morning and see if > there could possibly be any mismatched send/receives or anything. > > -Ben > > On Apr 9, 2013, at 8:48 PM, "Derek Gaston" w

Re: [Libmesh-devel] Job Hanging

2013-04-09 Thread Kirk, Benjamin (JSC-EG311)
Hmm - I'll look through that section of code tomorrow morning and see if there could possibly be any mismatched send/receives or anything. -Ben On Apr 9, 2013, at 8:48 PM, "Derek Gaston" wrote: > Hey guys, > > I've got a fairly large job (>3500 procs) that is hanging while trying to > setup