Re: [Lustre-discuss] lustre-discuss@lists.lustre.org

2010-12-20 Thread Chris Gearing
Hi Aurélien,

Do you have a specification for the junit test results you produce, or 
an example of one of your test results sets. We would be more than
willing to pick up and go with something that can be used with a wider 
set of tools, with the obvious caveat that it provides everything needed 
to completely capture the test results for Lustre today and in the future.

If you have some example results set's that you can forward please mail 
them to chris whamcloud.com

Thanks

Chris

I see that PerfPublisher uses xml, although this seems to be the only 
specification.

On 17/12/2010 20:11, Aurélien wrote:
 Robert Read a écrit :
 We don't plan to use Hudson to manage our testing results as I don't 
 think it would scale very well for all the testing we might do for 
 each build. We're currently building a more custom results server 
 that's similar (in spirit at least) to the kinds of tools we had at 
 Oracle.  We'll make it available once it's in presentable form.
 Actually, our first step was to replace the acceptance-small.sh 
 driver script with one that has a more sensible user interface for 
 running the standard tests.  Since the test-framework.sh on master 
 has already been changed to produce test results in yaml format, 
  the new script collects these with the logs, and is capable of 
 submitting them to the test results server.   Currently this is 
 being run manually, though.  Automating the test execution and 
 connecting all the pieces will be next step.
 Ok. I will be very interested in seeing the final result.
 But I think it is a good idea to stick to standard format and tools 
 as much as possible. This could be a pity if all your new work will 
 be only usable by  your tool.

 Junit is quite standard.
 PerfPublisher has its own format due to junit limitations. There is 
 other ones. It could be really good if you do not create a new one.

 And indeed, acc-sm is a bit limited and improve it could be really 
 interesting.


 Aurélien



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] how to reuse OST indices (EADDRINUSE)

2010-12-20 Thread Craig Prescott

Hello list,

We recently evacuated several OSTs on a single OSS, replaced RAID 
controllers, re-initialized RAIDs for new OSTs, and made new lustre 
filesystems for them, using the same OST indices as we had before.

The filesystem and all its clients have been up and running the whole 
time.  We disabled the OSTs we were working on on all clients and our 
MGS/MDS (lctl dl shows them as IN everywhere).

Now we want to bring the newly-formatted OSTs back online.  When we try 
to mount the new OSTs, we get this for each one in this syslog of the 
OSS that has been under maintenance:

 Lustre: mgc10.13.28@o2ib: Reactivating import
 LustreError: 11-0: an error occurred while communicating with 
 10.13.28@o2ib. The mgs_target_reg operation failed with -98
 LustreError: 6065:0:(obd_mount.c:1097:server_start_targets()) Required 
 registration failed for cms-OST0006: -98
 LustreError: 6065:0:(obd_mount.c:1655:server_fill_super()) Unable to start 
 targets: -98
 LustreError: 6065:0:(obd_mount.c:1438:server_put_super()) no obd cms-OST0006
 LustreError: 6065:0:(obd_mount.c:147:server_deregister_mount()) cms-OST0006 
 not registered

What do we need to do to get these OSTs back into the filesystem?

We really want to reuse the original indices.

This is Lustre 1.8.4, btw.

Thanks,
Craig Prescott
UF HPC Center
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] how to reuse OST indices (EADDRINUSE)

2010-12-20 Thread Wang Yibin
Hello,

Did you backup old magic files (last_rcvd, LAST_ID, CONFIG/*) from the original 
OSTs and put them back before trying to mount them?
You probably didn't do that. So when you remount the OSTs with existing index, 
the MGS will refuse to add them without being told to writeconf, hence 
-EADDRINUSE.
The proper ways to replace an OST are described in bug 24128.

在 2010-12-21,上午8:33, Craig Prescott 写道:

 
 Hello list,
 
 We recently evacuated several OSTs on a single OSS, replaced RAID 
 controllers, re-initialized RAIDs for new OSTs, and made new lustre 
 filesystems for them, using the same OST indices as we had before.
 
 The filesystem and all its clients have been up and running the whole 
 time.  We disabled the OSTs we were working on on all clients and our 
 MGS/MDS (lctl dl shows them as IN everywhere).
 
 Now we want to bring the newly-formatted OSTs back online.  When we try 
 to mount the new OSTs, we get this for each one in this syslog of the 
 OSS that has been under maintenance:
 
 Lustre: mgc10.13.28@o2ib: Reactivating import
 LustreError: 11-0: an error occurred while communicating with 
 10.13.28@o2ib. The mgs_target_reg operation failed with -98
 LustreError: 6065:0:(obd_mount.c:1097:server_start_targets()) Required 
 registration failed for cms-OST0006: -98
 LustreError: 6065:0:(obd_mount.c:1655:server_fill_super()) Unable to start 
 targets: -98
 LustreError: 6065:0:(obd_mount.c:1438:server_put_super()) no obd cms-OST0006
 LustreError: 6065:0:(obd_mount.c:147:server_deregister_mount()) cms-OST0006 
 not registered
 
 What do we need to do to get these OSTs back into the filesystem?
 
 We really want to reuse the original indices.
 
 This is Lustre 1.8.4, btw.
 
 Thanks,
 Craig Prescott
 UF HPC Center
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Fwd: Reg /// OSS rebooted automatically

2010-12-20 Thread Jeff Johnson
Daniel,

It looks like your OST backend storage device may be having an issue. I
would check the health and stability of the backend storage device or raid
you are using for an OST device. It wouldn't likely cause a system reboot of
your OSS system. There may be more problems, hardware and/or OS related that
are causing the system to reboot in addition to Lustre complaining that it
can't find the OST storage device.

Others here on the list will likely give you a more detailed answer. The
storage device is the place i would look first.

--Jeff

-- 
--
Jeff Johnson
Manager
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x101   f: 858-412-3845
m: 619-204-9061

4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117

On Mon, Dec 20, 2010 at 9:43 PM, Daniel Raj danielraj2...@gmail.com wrote:




 Hi Genius,


 Good Day  !!


 I am Daniel. My OSS getting  automatically rebooted again and again .
 kindly help to me

 Its showing the below error messages


  *kernel: LustreError: 23351:0:(ldlm_lib.c:1892:target_send_reply_msg())
 @@@ processing error (-19)  r...@810400e24400 x1353488904620274/t0
 o8-?@?:0/0 lens 368/0 e 0 to 0 dl 1292738958 ref 1 fl Interpret:/0/0 rc
 -19/0
 kernel: LustreError: 137-5: UUID 'south-ost7_UUID' is not available  for
 connect (no target)
 kernel: LustreError: 23284:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@
 processing error (-19)  r...@8101124c7c00 x1353488904620359/t0
 o8-?@?:0/0 lens 368/0 e 0 to 0 dl 1292739025 ref 1 fl Interpret:/0/0 rc
 -19/0
 *

 Regards,

 Daniel A


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Fwd: Reg /// OSS rebooted automatically

2010-12-20 Thread Daniel Raj
Hi Jeff,


Thanks for your reply

*Storage information *:


DL380G5   == OSS + 16GB Ram
OS== SFS G3.2-2 + centos 5.3 + lustre 1.8.3
MSA60 box   == OST
RAID 6


Regards,

Daniel A

On Tue, Dec 21, 2010 at 11:45 AM, Jeff Johnson 
jeff.john...@aeoncomputing.com wrote:

 Daniel,

 It looks like your OST backend storage device may be having an issue. I
 would check the health and stability of the backend storage device or raid
 you are using for an OST device. It wouldn't likely cause a system reboot of
 your OSS system. There may be more problems, hardware and/or OS related that
 are causing the system to reboot in addition to Lustre complaining that it
 can't find the OST storage device.

 Others here on the list will likely give you a more detailed answer. The
 storage device is the place i would look first.

 --Jeff

 --
 --
 Jeff Johnson
 Manager
 Aeon Computing

 jeff.john...@aeoncomputing.com
 www.aeoncomputing.com
 t: 858-412-3810 x101   f: 858-412-3845
 m: 619-204-9061

 4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117


 On Mon, Dec 20, 2010 at 9:43 PM, Daniel Raj danielraj2...@gmail.comwrote:




 Hi Genius,


 Good Day  !!


 I am Daniel. My OSS getting  automatically rebooted again and again .
 kindly help to me

 Its showing the below error messages


  *kernel: LustreError: 23351:0:(ldlm_lib.c:1892:target_send_reply_msg())
 @@@ processing error (-19)  r...@810400e24400 x1353488904620274/t0
 o8-?@?:0/0 lens 368/0 e 0 to 0 dl 1292738958 ref 1 fl Interpret:/0/0 rc
 -19/0
 kernel: LustreError: 137-5: UUID 'south-ost7_UUID' is not available  for
 connect (no target)
 kernel: LustreError: 23284:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@
 processing error (-19)  r...@8101124c7c00 x1353488904620359/t0
 o8-?@?:0/0 lens 368/0 e 0 to 0 dl 1292739025 ref 1 fl Interpret:/0/0 rc
 -19/0
 *

 Regards,

 Daniel A



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Fwd: Reg /// OSS rebooted automatically

2010-12-20 Thread Jeff Johnson
Daniel,

Check the health and stability of your raid-6 volume. Make sure the raid is 
healthy and online. Use whatever monitor utility came with your raid card or 
check /proc/mdstat if it's a Linux mdraid. Check /var/log/messages for error 
messages from your raid or other hardware.

--Jeff

---mobile signature---
Jeff Johnson - Aeon Computing
jeff.john...@aeoncomputing.com

On Dec 20, 2010, at 22:27, Daniel Raj danielraj2...@gmail.com wrote:

 Hi Jeff,
 
 
 Thanks for your reply 
 
 Storage information : 
 
 
 DL380G5   == OSS + 16GB Ram 
 OS== SFS G3.2-2 + centos 5.3 + lustre 1.8.3
 MSA60 box   == OST
 RAID 6
 
 
 Regards,
 
 Daniel A 
 
 On Tue, Dec 21, 2010 at 11:45 AM, Jeff Johnson 
 jeff.john...@aeoncomputing.com wrote:
 Daniel,
 
 It looks like your OST backend storage device may be having an issue. I would 
 check the health and stability of the backend storage device or raid you are 
 using for an OST device. It wouldn't likely cause a system reboot of your OSS 
 system. There may be more problems, hardware and/or OS related that are 
 causing the system to reboot in addition to Lustre complaining that it can't 
 find the OST storage device.
 
 Others here on the list will likely give you a more detailed answer. The 
 storage device is the place i would look first.
 
 --Jeff
 
 -- 
 --
 Jeff Johnson
 Manager
 Aeon Computing
 
 jeff.john...@aeoncomputing.com
 www.aeoncomputing.com
 t: 858-412-3810 x101   f: 858-412-3845
 m: 619-204-9061
 
 4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117
 
 
 On Mon, Dec 20, 2010 at 9:43 PM, Daniel Raj danielraj2...@gmail.com wrote:
 
 
 
 Hi Genius,
 
 
 Good Day  !!
 
 
 I am Daniel. My OSS getting  automatically rebooted again and again . kindly 
 help to me 
 
 Its showing the below error messages 
 
 
  kernel: LustreError: 23351:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ 
 processing error (-19)  r...@810400e24400 x1353488904620274/t0 
 o8-?@?:0/0 lens 368/0 e 0 to 0 dl 1292738958 ref 1 fl Interpret:/0/0 rc 
 -19/0
 kernel: LustreError: 137-5: UUID 'south-ost7_UUID' is not available  for 
 connect (no target)
 kernel: LustreError: 23284:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ 
 processing error (-19)  r...@8101124c7c00 x1353488904620359/t0 
 o8-?@?:0/0 lens 368/0 e 0 to 0 dl 1292739025 ref 1 fl Interpret:/0/0 rc 
 -19/0
 
 
 Regards,
 
 Daniel A 
 
 
 
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss