Re: [Lustre-discuss] Not sure how we should configure our RAID arrays (HW limitation)

2012-05-07 Thread Kevin Van Maren
The 512K stripe size should be fine for Lustre, and 128KB per disk is enough to 
get good performance from the underlying hard drive.

I don't know anything about the E18s beyond what you've posted, so I can't 
guess which configuration is more optimal, so I would suggest you create the 
RAID arrays, format the LUNs for Lustre, and run the Lustre iokit and see how 
the various configurations perform (3 * 4+2, 2 * 8+1, 2 * 7+2).  Then please 
post results (with mkfs, etc command lines) here so others can benefit from 
your experiments and/or suggest additional tunings.

Kevin


On May 4, 2012, at 3:14 PM, Frank Riley wrote:

 How about doing 3 4+2 RAIDs?  12 usable disks, instead of 14 or 16, but still
 better than 8 with RAID1.  Doing 4*128KB, resulting in 2 full-stripe writes 
 for
 each 1MB IO is not that bad.
 
 Yes, of course. I had thought of this option earlier but forgot to include 
 it. Thanks for reminding me. So using a stripe width of 512K will not harm 
 performance that much? Note also that the E18s have two active/active 
 controllers in them so that means one controller will be handling I/O 
 requests for 2 arrays, which will reduce performance somewhat. Would this 
 affect your decision between 3 4+2 (512K) or 2 7+2 (896K)?
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss


Confidentiality Notice: This e-mail message, its contents and any attachments 
to it are confidential to the intended recipient, and may contain information 
that is privileged and/or exempt from disclosure under applicable law. If you 
are not the intended recipient, please immediately notify the sender and 
destroy the original e-mail message and any attachments (and any copies that 
may have been made) from your system or otherwise. Any unauthorized use, 
copying, disclosure or distribution of this information is strictly prohibited. 
 Email addresses that end with a ?-c? identify the sender as a Fusion-io 
contractor.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] recovery from multiple disks failure on the same md

2012-05-07 Thread Kevin Van Maren

On May 6, 2012, at 10:13 PM, Tae Young Hong wrote:

Hi,

I found the terrible situation on our lustre system.
A OST (raid 6: 8+2, spare 1) had 2 disk failures almost at the same time. While 
recovering it, another disk failed. so recovering procedure seems to be halt, 
and the spare disk which were in resync fell into spare status again. (I 
guess that resync procedure almost finished more than 95%)
Right now we have just 7 disks for this md. Is there any possibility to recover 
from this situation?



It might be possible, but not something I've done.  If the array has not been 
written to since a drive failed, you might be able to power-cycle the failed 
drives (to reset the firmware) and force re-add them (without a rebuild)?  If 
the array _has_ been modified (most likely) you could write a sector of 0's to 
the bad sector, which will corrupt just that stripe, and force-re-add the last 
failed drive and attempt to rebuild again.

Certainly if you have a support contract I'd recommend you get professional 
assistance.



Unfortunately, the failure mode you encountered is all too common.  Because the 
Linux SW RAID code does not read the parity blocks unless there is a problem, 
hard drive failures are NOT independent: drives appear to fail more often 
during a rebuild than at any other time.  The only way to work around this 
problem is to periodically do a verify of the MD array.

A verify allows the drive, which is failing in the 20% of the space that 
contains parity, to fail _before_ the data becomes unreadable, rather than fail 
_after_ the data becomes unreadable.  Don't do it on a degraded array, but it 
is a good way to ensure healthy arrays are really healthy.

echo check  /sys/block/mdX/md/sync_action to force a verify.  Parity 
mis-matches will be reported (not corrected), but drive failures can be dealt 
with sooner, rather than letting them stack up.  Do man md and see the 
sync_action section.

Also note that Lustre 1.8.7 has a fix to the SW RAID code (corruption when 
rebuilding under load).  Oracle's release called the patch 
md-avoid-corrupted-ldiskfs-after-rebuild.patch, while Whamcloud called it 
raid5-rebuild-corrupt-bug.patch

Kevin



The following is detailed log.
#1 the original configuration before any failure

 Number   Major   Minor   RaidDevice State
   0   8  1760  active sync   /dev/sdl
   1   8  1921  active sync   /dev/sdm
   2   8  2082  active sync   /dev/sdn
   3   8  2243  active sync   /dev/sdo
   4   8  2404  active sync   /dev/sdp
   5  6505  active sync   /dev/sdq
   6  65   166  active sync   /dev/sdr
   7  65   327  active sync   /dev/sds
   8  65   488  active sync   /dev/sdt
   9  65   969  active sync   /dev/sdw

  10  65   64-  spare   /dev/sdu

#2 a disk(sdl) failed, and resync started after adding spare disk(sdu)
May  7 04:53:33 oss07 kernel: sd 1:0:10:0: SCSI error: return code = 0x0802
May  7 04:53:33 oss07 kernel: sdl: Current: sense key: Medium Error
May  7 04:53:33 oss07 kernel: Add. Sense: Unrecovered read error
May  7 04:53:33 oss07 kernel:
May  7 04:53:33 oss07 kernel: Info fld=0x74241ace
May  7 04:53:33 oss07 kernel: end_request: I/O error, dev sdl, sector 1948523214
... ...
May  7 04:54:15 oss07 kernel: RAID5 conf printout:
May  7 04:54:16 oss07 kernel:  --- rd:10 wd:9 fd:1
May  7 04:54:16 oss07 kernel:  disk 1, o:1, dev:sdm
May  7 04:54:16 oss07 kernel:  disk 2, o:1, dev:sdn
May  7 04:54:16 oss07 kernel:  disk 3, o:1, dev:sdo
May  7 04:54:16 oss07 kernel:  disk 4, o:1, dev:sdp
May  7 04:54:16 oss07 kernel:  disk 5, o:1, dev:sdq
May  7 04:54:16 oss07 kernel:  disk 6, o:1, dev:sdr
May  7 04:54:16 oss07 kernel:  disk 7, o:1, dev:sds
May  7 04:54:16 oss07 kernel:  disk 8, o:1, dev:sdt
May  7 04:54:16 oss07 kernel:  disk 9, o:1, dev:sdw
May  7 04:54:16 oss07 kernel: RAID5 conf printout:
May  7 04:54:16 oss07 kernel:  --- rd:10 wd:9 fd:1
May  7 04:54:16 oss07 kernel:  disk 0, o:1, dev:sdu
May  7 04:54:16 oss07 kernel:  disk 1, o:1, dev:sdm
May  7 04:54:16 oss07 kernel:  disk 2, o:1, dev:sdn
May  7 04:54:16 oss07 kernel:  disk 3, o:1, dev:sdo
May  7 04:54:16 oss07 kernel:  disk 4, o:1, dev:sdp
May  7 04:54:16 oss07 kernel:  disk 5, o:1, dev:sdq
May  7 04:54:16 oss07 kernel:  disk 6, o:1, dev:sdr
May  7 04:54:16 oss07 kernel:  disk 7, o:1, dev:sds
May  7 04:54:16 oss07 kernel:  disk 8, o:1, dev:sdt
May  7 04:54:16 oss07 kernel:  disk 9, o:1, dev:sdw
May  7 04:54:16 oss07 kernel: md: syncing RAID array md12


#3 another disk(sdp) failed
May  7 04:54:42 oss07 kernel: end_request: I/O error, dev sdp, sector 1949298688
May  7 04:54:42 oss07 kernel: mptbase: ioc1: LogInfo(0x3108): 
Originator={PL}, Code={SATA NCQ FaCommands After Error}, SubCode(0x)
May  7 04:54:42 oss07 

Re: [Lustre-discuss] Not sure how we should configure our RAID arrays (HW limitation)

2012-05-04 Thread Kevin Van Maren

On May 4, 2012, at 2:53 PM, Frank Riley wrote:

 Hello,
 
 We are using Nexsan E18s for our storage systems, and we are in the process 
 of setting them up for Lustre. Each E18 has 18 disks total (max'ed out) in 
 them. According to the Lustre docs, I want to have a stripe width of 1MB. 
 Unfortunately, these E18s have a max stripe size of 128K. As I see it, for 
 RAID6 this leaves us two options:
 
 1)  One array 16+2 with a stripe size of 64K for a stripe width of 1MB. I'm 
 hesitant with this option because of the increased chance that we could have 
 more than 2 disks fail.
 
 2) Do two arrays 7+2 with a stripe size of 128K for a stripe width of 896K. 
 I'd then modify the max_pages_per_rpc tunable to match the 896K. I'm not sure 
 what to do with the flex_bg filesystem option since it has to be a power of 2.

Note that you need to set the stripe size to match 896K, as otherwise you will 
send 896 and 128KB to each OST.  Additional tuning of the mkfs options is also 
necessary so that the file system understands the layout (see -E in the Lustre 
manual), as otherwise all the block allocations will start mid-stripe.  This is 
not ideal for applications that expect Po2 sizes to be optimal.

 
 What is the better option here? Or is there an option I'm missing? I've 
 pretty much ruled out RAID5 arrays at 8+1 due to data loss risk, and RAID1+0 
 wastes too much disk for our use.

8+1 is the best option from a Lustre performance standpoint.  You should get 
better performance from 2 7+2 arrays than with a 16+2 simply because you can 
have twice the number of independent IOs.

How about doing 3 4+2 RAIDs?  12 usable disks, instead of 14 or 16, but still 
better than 8 with RAID1.  Doing 4*128KB, resulting in 2 full-stripe writes for 
each 1MB IO is not that bad.

Kevin


This e-mail message, its contents and any attachments to it are confidential to 
the intended recipient, and may contain information that is privileged and/or 
exempt from disclosure under applicable law. If you are not the intended 
recipient, please immediately notify the sender and destroy the original e-mail 
message and any attachments (and any copies that may have been made) from your 
system or otherwise. Any unauthorized use, copying, disclosure or distribution 
of this information is strictly prohibited.  Email addresses that end with a 
?-c? identify the sender as a Fusion-io contractor.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] erroneous ENOSPC -28

2012-04-23 Thread Kevin Van Maren
There is information in that bug; have you looked at the tot_granted
and compared it to the sum on the cur_grant_bytes?

The workaround, until you upgrade, was to restart the OSTs to reset the grant.

Kevin



On Apr 23, 2012, at 9:18 AM, Gretchen Zwart wrote:

 Hi,
 Debian 5.0
 2.6.26-2-amd64 SMP (SLES 11)
 Lustre 1.8.1.1-1
 
 Lustre clients are getting ENOSPC -28 error messages but 'lfs df'
 results indicate that OSTs are no more than 50% full. This looks like
 it could be related to Bug 22755.
 What is the best way to nail down if this is the cause? I'm in the
 upgrade process, but I'd also like to know what is the fastest/best
 method to restore lustre functionality should I encounter this again.
 Regards,
 -- 
 Gretchen Zwart
 UMass Astronomy Dept. 619E Lederle
 710 North Pleasant ST
 Amherst,MA 01003
 (413) 577-2108
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss


Confidentiality Notice: This e-mail message, its contents and any attachments 
to it are confidential to the intended recipient, and may contain information 
that is privileged and/or exempt from disclosure under applicable law. If you 
are not the intended recipient, please immediately notify the sender and 
destroy the original e-mail message and any attachments (and any copies that 
may have been made) from your system or otherwise. Any unauthorized use, 
copying, disclosure or distribution of this information is strictly prohibited. 
 Email addresses that end with a ?-c? identify the sender as a Fusion-io 
contractor.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre on Debian

2012-04-01 Thread Kevin Van Maren
Save yourself pain: use a supported RedHat kernel on the server.


On Apr 1, 2012, at 8:08 AM, Mario Benitez wrote:

Hi guys,

I'm trying to set up Luster on Debian (server  clients). Any hints out there?

Thanx in avance.

Marinho.-
ATT1..txt


Confidentiality Notice: This e-mail message, its contents and any attachments 
to it are confidential to the intended recipient, and may contain information 
that is privileged and/or exempt from disclosure under applicable law. If you 
are not the intended recipient, please immediately notify the sender and 
destroy the original e-mail message and any attachments (and any copies that 
may have been made) from your system or otherwise. Any unauthorized use, 
copying, disclosure or distribution of this information is strictly prohibited. 
 Email addresses that end with a ?-c? identify the sender as a Fusion-io 
contractor.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OSS1 Node issue

2012-02-21 Thread Kevin Van Maren
This is not the correct list for help with SGE.

That being said, the real issue (as has been mentioned by several people) is 
that an OST has gone read-only due to some issue.  The file system will not 
function properly until this is resolved, irrespective of where you put SGE.

You will need to check the logs on oss1 to find the initial issue, stop the bad 
ost, and take corrective action (the details of which depend on the issue),

Kevin

Sent from my iPhone

On Feb 21, 2012, at 3:23 AM, VIJESH EK 
ekvij...@gmail.commailto:ekvij...@gmail.com wrote:

-


We are waiting for your feedback.

Thanks  Regards

VIJESH E K



On Tue, Feb 21, 2012 at 12:22 PM, VIJESH EK 
mailto:ekvij...@gmail.comekvij...@gmail.commailto:ekvij...@gmail.com 
wrote:
Dear All,

We have done the following changes  in the exec Nodes , still now also we are
getting the same errors in /var/log/messages.

1. We have changed the exec Nodes spool directory to local directory by editing 
the file /home/appl/sge-root/default/common/configuration and changes the 
parameter  execd_spool_dir.

After changing this also the same error, i.e below mentioned error is coming in 
OSS1 Node. This error is generating only in the OSS1 Node.

Feb  6 18:32:10 oss1 kernel: LustreError: 
9362:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  6 18:32:05 oss1 kernel: LustreError: 
9422:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  6 18:32:06 oss1 kernel: LustreError: 
9432:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  6 18:32:07 oss1 kernel: LustreError: 
9369:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  6 18:32:10 oss1 kernel: LustreError: 
9362:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30


Can u tell me how to change the Master spool directory  ?
Is it possible to change the directory in live mode ?

Kindly explain briefly, so that we can proceed for the next step..


Thanks and Regards

VIJESH







On Fri, Feb 10, 2012 at 1:19 PM, Carlos Thomaz 
mailto:ctho...@ddn.comctho...@ddn.commailto:ctho...@ddn.com wrote:
Hi vijesh.

Are you running the SGE master spooling on lustre?!?! What about the exec nodes 
spooling?!

I strongly recommend you to do not run the master spooling on lustre. And if 
possible use local spooling on local disk for the exec nodes.

SGE (át. least until version 6.2u7) is known to get unstable when running the 
spooling on lustre.

Carlos

On Feb 10, 2012, at 1:18 AM, VIJESH EK 
mailto:ekvij...@gmail.comekvij...@gmail.commailto:ekvij...@gmail.com 
wrote:

Dear All,

Kindly get a solution for these below issue...

Thanks  Regards

VIJESH E K



On Thu, Feb 9, 2012 at 3:26 PM, VIJESH EK 
mailto:ekvij...@gmail.comekvij...@gmail.commailto:ekvij...@gmail.com 
wrote:
Dear Sir,

I am getting below mentioned error messages continuously in OSS1 Node,it causes 
that
sge service is not running intermittently...


Feb  5 04:03:37 oss1 kernel: LustreError: 
9193:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:03:47 oss1 kernel: LustreError: 
9164:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:03:47 oss1 kernel: LustreError: 
28420:0:(filter_io_26.c:693:filter_commitrw_write()) error starting 
transaction: rc = -30
Feb  5 04:03:48 oss1 kernel: LustreError: 
9266:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:03:50 oss1 kernel: LustreError: 
9200:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:03:53 oss1 kernel: LustreError: 
9230:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:03:57 oss1 kernel: LustreError: 
9212:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:04:03 oss1 kernel: LustreError: 
9262:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:04:08 oss1 kernel: LustreError: 
9162:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:04:15 oss1 kernel: LustreError: 
9271:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:04:23 oss1 kernel: LustreError: 
9191:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:04:32 oss1 kernel: LustreError: 
9242:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30


The detailed log information  i have attached herewith.. The attached file 
containes the /var/log/messages
continuous logs seperated by *.

So kindly give me a solution for this issue...

Thanks  Regards

VIJESH E K





-
ATT1.c



-




-

Confidentiality Notice: This e-mail message, its contents and any attachments 
to it are 

Re: [Lustre-discuss] OSS1 Node issue

2012-02-21 Thread Kevin Van Maren
The logs you attached start sometime after the issue: to tell what happened you 
need to find the error in the logs before you started getting these errors:
  Feb  5 04:03:13 oss1 kernel: LustreError: 
9222:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30

It looks like you rebooted the server, and OST0 and OST1 were mounted, and you 
are NOT getting those errors any more, but both OSTs reported errors on mount.

So unmount the OSTs, and run:
  e2fsck /dev/dm-0
  e2fsck /dev/dm-1

I don't know how mangled your OSTs are, so I don't know what e2fsck will 
report.  See also http://wiki.lustre.org/index.php/Handling_File_System_Errors

Kevin



On Feb 21, 2012, at 10:43 PM, VIJESH EK wrote:

Dear Kevin,

Herewith i have attached the /var/log/messages , kindly go through the logs and
give me a solution for this immly.
Can u tell me How to run e2fsck for OST  ? ,
Pl tell the exact command with switch how to run e2fsck
without effecting the data.

we are waiting for your reply.

Thanks  Regards

VIJESH E K


On Tue, Feb 21, 2012 at 8:38 PM, Kevin Van Maren 
kvanma...@fusionio.commailto:kvanma...@fusionio.com wrote:
This is not the correct list for help with SGE.

That being said, the real issue (as has been mentioned by several people) is 
that an OST has gone read-only due to some issue.  The file system will not 
function properly until this is resolved, irrespective of where you put SGE.

You will need to check the logs on oss1 to find the initial issue, stop the bad 
ost, and take corrective action (the details of which depend on the issue),

Kevin

Sent from my iPhone

On Feb 21, 2012, at 3:23 AM, VIJESH EK 
ekvij...@gmail.commailto:ekvij...@gmail.com wrote:

-


We are waiting for your feedback.

Thanks  Regards

VIJESH E K



On Tue, Feb 21, 2012 at 12:22 PM, VIJESH EK 
mailto:ekvij...@gmail.comekvij...@gmail.commailto:ekvij...@gmail.com 
wrote:
Dear All,

We have done the following changes  in the exec Nodes , still now also we are
getting the same errors in /var/log/messages.

1. We have changed the exec Nodes spool directory to local directory by editing 
the file /home/appl/sge-root/default/common/configuration and changes the 
parameter  execd_spool_dir.

After changing this also the same error, i.e below mentioned error is coming in 
OSS1 Node. This error is generating only in the OSS1 Node.

Feb  6 18:32:10 oss1 kernel: LustreError: 
9362:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  6 18:32:05 oss1 kernel: LustreError: 
9422:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  6 18:32:06 oss1 kernel: LustreError: 
9432:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  6 18:32:07 oss1 kernel: LustreError: 
9369:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  6 18:32:10 oss1 kernel: LustreError: 
9362:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30


Can u tell me how to change the Master spool directory  ?
Is it possible to change the directory in live mode ?

Kindly explain briefly, so that we can proceed for the next step..


Thanks and Regards

VIJESH







On Fri, Feb 10, 2012 at 1:19 PM, Carlos Thomaz 
mailto:ctho...@ddn.comctho...@ddn.commailto:ctho...@ddn.com wrote:
Hi vijesh.

Are you running the SGE master spooling on lustre?!?! What about the exec nodes 
spooling?!

I strongly recommend you to do not run the master spooling on lustre. And if 
possible use local spooling on local disk for the exec nodes.

SGE (át. least until version 6.2u7) is known to get unstable when running the 
spooling on lustre.

Carlos

On Feb 10, 2012, at 1:18 AM, VIJESH EK 
mailto:ekvij...@gmail.comekvij...@gmail.commailto:ekvij...@gmail.com 
wrote:

Dear All,

Kindly get a solution for these below issue...

Thanks  Regards

VIJESH E K



On Thu, Feb 9, 2012 at 3:26 PM, VIJESH EK 
mailto:ekvij...@gmail.comekvij...@gmail.commailto:ekvij...@gmail.com 
wrote:
Dear Sir,

I am getting below mentioned error messages continuously in OSS1 Node,it causes 
that
sge service is not running intermittently...


Feb  5 04:03:37 oss1 kernel: LustreError: 
9193:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:03:47 oss1 kernel: LustreError: 
9164:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:03:47 oss1 kernel: LustreError: 
28420:0:(filter_io_26.c:693:filter_commitrw_write()) error starting 
transaction: rc = -30
Feb  5 04:03:48 oss1 kernel: LustreError: 
9266:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:03:50 oss1 kernel: LustreError: 
9200:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:03:53 oss1 kernel: LustreError: 
9230:0:(filter_io_26.c:693:filter_commitrw_write()) error

Re: [Lustre-discuss] LNET Performance Issue

2012-02-15 Thread Kevin Van Maren
Perhaps someone else here has a thought, but it does not make sense to me that 
loading SDP (which accelerates TCP traffic by by-passing the TCP stack) makes 
lnet faster if you are using ip@o2ib, and _not_ ip@tcp0 for your NIDs.

Any chance you've configured both TCP and O2IB nids on the machine, and it is 
somehow picking the TCP nids to use?

Can you confirm the lctl list_nids output, and your lustre/lnet sections of 
your modprobe.conf?

Kevin


On Feb 15, 2012, at 12:30 PM, Barberi, Carl E wrote:

We are having issues with LNET performance over Infiniband.  We have a 
configuration with a single MDT and six (6) OSTs.  The Lustre client I am using 
to test is configured to use 6 stripes (lfs setstripe -c  6 /mnt/lustre).  When 
I perform a test using the following command:

dd if=/dev/zero of=/mnt/lustre/test.dat bs=1M count=2000

I typically get a write rate of about 815 MB/s, and we never exceed 848 MB/s.  
When I run obdfilter-survey, we easily get about 3-4GB/s write speed, but when 
I run a series of lnet-selftests, the read and write rates range from 850MB/s – 
875MB/s max.  I have performed the following optimizations to increase the data 
rate:

On the Client:
lctl set_param osc.*.checksums=0
lctl set_param osc.*.max_dirty_mb=256

On the OSTs
lctl set_param obdfilter.*.writethrough_cache_enable=0
lctl set_param obdfilter.*.read_cache_enable=0

echo 4096  /sys/block/devices/queue/nr_requests

I have also loaded the ib_sdp module, which also brought an increase in speed.  
However, we need to be able to record at no less than 1GB/s, which we cannot 
achieve right now.  Any thoughts on how I can optimize LNET, which clearly 
seems to be the bottleneck?

Thank you for any help you can provide,
Carl Barberi
ATT1..txt


This e-mail message, its contents and any attachments to it are confidential to 
the intended recipient, and may contain information that is privileged and/or 
exempt from disclosure under applicable law. If you are not the intended 
recipient, please immediately notify the sender and destroy the original e-mail 
message and any attachments (and any copies that may have been made) from your 
system or otherwise. Any unauthorized use, copying, disclosure or distribution 
of this information is strictly prohibited.  Email addresses that end with a 
“-c” identify the sender as a Fusion-io contractor.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] obdidx ordering in lfs getstripe

2012-02-14 Thread Kevin Van Maren
On Feb 14, 2012, at 12:13 AM, Jack David wrote:

 On Thu, Feb 9, 2012 at 8:18 PM, Andreas Dilger adil...@whamcloud.com wrote:
 On 2012-02-09, at 6:20 AM, Jack David wrote:
 In the output of lsf getstripe filename | dirname, the obdidx
 denotes the OST index (I assume).
 
 Consider the following output:
 
 lmm_stripe_count:   2
 lmm_stripe_size:1048576
 lmm_stripe_offset:  1
   obdidx   objid  objidgroup
1   20x20
0   30x30
 
 where I have a setup consisting of two OSTs. If I have more than two
 OSTs, is it possible that I get the obdidx values out of order? Or the
 obdidx values will always be linear?
 
 For example, in above output, the values are linear (like 1, 0 - and
 this pattern will be repeated while storing the data I assume). If I
 have 4 OSTs, can the values be non-linear? Something like 2,0,1,3 or
 2,1,3,0 (or any pattern for that matter)??
 
 Typically the ordering will be linear, but this depends on a number of
 different factors:
 - what order the OSTs were created in:  without --index=N the OST order
  depends on the order in which they were first mounted, so using --index
  is always recommended, and will be mandatory in the future
 - the distribution of OSTs among OSS nodes:  the MDS object allocator
  will normally select one OST from each OSS before allocating another
  object from a different OST on the same OSS
 
 Thanks for this information.
 
 - the space available on each OST:  when OST free space is imbalanced
  the OSTs will be selected in part based on how full they are
 
 I have a doubt here. Lets say I have 4 OSTs, but the lustre client is
 issuing the write request having which can be accommodated by any
 single OST (e.g. write request is of size 512bytes and stripe_size is
 1MB). In this case, how will the data be stored? Will the MDS maintain
 the index of next OST which should serve the request?


I think you are still confused about how it works.  The OSTs are selected
_when the file is created_.  The striping is a static map of offset to OST.
For example, if the stripe count = 2, and the stripe size = 1MB, then
0-1MB goes to the first OST, 1-2MB goes to the second, 2-3 goes to the first, 
etc.

The free space impacts _which_ OSTs are selected when a file is created,
it does NOT impact where data is written once a file a created.  So if an OST
fills up, every file that resides on that OST will be unable to grow if the 
growth is
to an offset that maps to that OST.

Kevin


Confidentiality Notice: This e-mail message, its contents and any attachments 
to it are confidential to the intended recipient, and may contain information 
that is privileged and/or exempt from disclosure under applicable law. If you 
are not the intended recipient, please immediately notify the sender and 
destroy the original e-mail message and any attachments (and any copies that 
may have been made) from your system or otherwise. Any unauthorized use, 
copying, disclosure or distribution of this information is strictly prohibited. 
 Email addresses that end with a ?-c? identify the sender as a Fusion-io 
contractor.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] obdidx ordering in lfs getstripe

2012-02-14 Thread Kevin Van Maren
On Feb 14, 2012, at 6:51 AM, Jack David wrote:

 On Tue, Feb 14, 2012 at 6:57 PM, Kevin Van Maren kvanma...@fusionio.com 
 wrote:
 On Feb 14, 2012, at 12:13 AM, Jack David wrote:
 
 On Thu, Feb 9, 2012 at 8:18 PM, Andreas Dilger adil...@whamcloud.com 
 wrote:
 On 2012-02-09, at 6:20 AM, Jack David wrote:
 In the output of lsf getstripe filename | dirname, the obdidx
 denotes the OST index (I assume).
 
 Consider the following output:
 
 lmm_stripe_count:   2
 lmm_stripe_size:1048576
 lmm_stripe_offset:  1
   obdidx   objid  objidgroup
1   20x20
0   30x30
 
 where I have a setup consisting of two OSTs. If I have more than two
 OSTs, is it possible that I get the obdidx values out of order? Or the
 obdidx values will always be linear?
 
 For example, in above output, the values are linear (like 1, 0 - and
 this pattern will be repeated while storing the data I assume). If I
 have 4 OSTs, can the values be non-linear? Something like 2,0,1,3 or
 2,1,3,0 (or any pattern for that matter)??
 
 Typically the ordering will be linear, but this depends on a number of
 different factors:
 - what order the OSTs were created in:  without --index=N the OST order
  depends on the order in which they were first mounted, so using --index
  is always recommended, and will be mandatory in the future
 - the distribution of OSTs among OSS nodes:  the MDS object allocator
  will normally select one OST from each OSS before allocating another
  object from a different OST on the same OSS
 
 Thanks for this information.
 
 - the space available on each OST:  when OST free space is imbalanced
  the OSTs will be selected in part based on how full they are
 
 I have a doubt here. Lets say I have 4 OSTs, but the lustre client is
 issuing the write request having which can be accommodated by any
 single OST (e.g. write request is of size 512bytes and stripe_size is
 1MB). In this case, how will the data be stored? Will the MDS maintain
 the index of next OST which should serve the request?
 
 
 I think you are still confused about how it works.  The OSTs are selected
 _when the file is created_.  The striping is a static map of offset to OST.
 For example, if the stripe count = 2, and the stripe size = 1MB, then
 0-1MB goes to the first OST, 1-2MB goes to the second, 2-3 goes to the 
 first, etc.
 
 I understand that, but just got curious that does lustre client keeps
 track of which is the _next_ OST where the IO request should go to? I

No, it does not track the next, as that depends on the file offset.  For 
example,
with the 2-OST stripe example in my previous email, if the client writes 0-1MB,
2-3MB, and 4-5MB, all the data will be written to a single OST.


 am unaware that who decides the stripe_size at the time of file
 creation (by default is 1MB - from lfs setstripe man page), so I
 assume client is not bothered about that. But if the client is
 generating the write request which is not in multiple of stripe_size,
 multiple write requests can be and stored into one OST (e.g. if stripe
 size is 1MB, then 20 req of 512bytes can be stored in OST1, next 20
 reqs on OST2 and likewise).


1MB is the default default, but the actual default can vary system to system.

The file stripe is determined when the file is created.  lfs setstripe can
be used to create a file with a specified striping.

lfs setstripe can aso be used to change the striping for a directory, which is
quite useful as that determines the default stripe for any files created in
that directory (including directories!)

When the client opens a file, the MDT returns the stripe information to the
client so that the client knows how to map file offsets to OST objects (and
the offset in that object).  It is the client's job (inside Lustre so it is 
automatic)
to figure out how to map a read/write to the server/ost/object/offset.

Kevin


 Actually I am trying to understand how can I leverage the pNFS file
 layout semantics (which communicates to Data Servers directly once the
 layout is supplied by Meta Data Server) with Lustre Filesystem, and
 that is the source of such questions.
 
 The free space impacts _which_ OSTs are selected when a file is created,
 it does NOT impact where data is written once a file a created.  So if an OST
 fills up, every file that resides on that OST will be unable to grow if the 
 growth is
 to an offset that maps to that OST.
 
 
 Good to know that.
 
 Kevin
 
 
 Confidentiality Notice: This e-mail message, its contents and any 
 attachments to it are confidential to the intended recipient, and may 
 contain information that is privileged and/or exempt from disclosure under 
 applicable law. If you are not the intended recipient, please immediately 
 notify the sender and destroy the original e-mail message and any 
 attachments (and any copies that may have been made) from your system or 
 otherwise

Re: [Lustre-discuss] Need cost for lustre

2012-02-14 Thread Kevin Van Maren
Lustre is free to use.  Support is optional, and that cost will vary depending 
on where you get it.

Kevin


On Feb 14, 2012, at 4:51 AM, Anantharamanan R wrote:

 Hello,
 
  I need to know the licensing cost for Lustre, Please provide me the 
 same
 
 
 Regards
 Ananth
 C-CAMP, NCBS
 INDIA  
 ATT1..txt


Confidentiality Notice: This e-mail message, its contents and any attachments 
to it are confidential to the intended recipient, and may contain information 
that is privileged and/or exempt from disclosure under applicable law. If you 
are not the intended recipient, please immediately notify the sender and 
destroy the original e-mail message and any attachments (and any copies that 
may have been made) from your system or otherwise. Any unauthorized use, 
copying, disclosure or distribution of this information is strictly prohibited. 
 Email addresses that end with a ?-c? identify the sender as a Fusion-io 
contractor.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OSS1 Node issue

2012-02-10 Thread Kevin Van Maren
Errno 30 is EROFS, read-only file system.  Perhaps there is some issue further 
up in the logs indicating the OST went read-only?

Kevin


On Feb 10, 2012, at 12:17 AM, VIJESH EK wrote:

Dear All,

Kindly get a solution for these below issue...

Thanks  Regards

VIJESH E K



On Thu, Feb 9, 2012 at 3:26 PM, VIJESH EK 
ekvij...@gmail.commailto:ekvij...@gmail.com wrote:
Dear Sir,

I am getting below mentioned error messages continuously in OSS1 Node,it causes 
that
sge service is not running intermittently...


Feb  5 04:03:37 oss1 kernel: LustreError: 
9193:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:03:47 oss1 kernel: LustreError: 
9164:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:03:47 oss1 kernel: LustreError: 
28420:0:(filter_io_26.c:693:filter_commitrw_write()) error starting 
transaction: rc = -30
Feb  5 04:03:48 oss1 kernel: LustreError: 
9266:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:03:50 oss1 kernel: LustreError: 
9200:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:03:53 oss1 kernel: LustreError: 
9230:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:03:57 oss1 kernel: LustreError: 
9212:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:04:03 oss1 kernel: LustreError: 
9262:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:04:08 oss1 kernel: LustreError: 
9162:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:04:15 oss1 kernel: LustreError: 
9271:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:04:23 oss1 kernel: LustreError: 
9191:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30
Feb  5 04:04:32 oss1 kernel: LustreError: 
9242:0:(filter_io_26.c:693:filter_commitrw_write()) error starting transaction: 
rc = -30


The detailed log information  i have attached herewith.. The attached file 
containes the /var/log/messages
continuous logs seperated by *.

So kindly give me a solution for this issue...

Thanks  Regards

VIJESH E K





-
ATT1..txt


Confidentiality Notice: This e-mail message, its contents and any attachments 
to it are confidential to the intended recipient, and may contain information 
that is privileged and/or exempt from disclosure under applicable law. If you 
are not the intended recipient, please immediately notify the sender and 
destroy the original e-mail message and any attachments (and any copies that 
may have been made) from your system or otherwise. Any unauthorized use, 
copying, disclosure or distribution of this information is strictly prohibited. 
 Email addresses that end with a ?-c? identify the sender as a Fusion-io 
contractor.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre 1.8.7 - Setup prototype in Research field - STUCK !

2012-02-07 Thread Kevin Van Maren
You can get the Oracle downloads from:
http://downloads.lustre.org/public/lustre/v1.8/lustre_1.8.7/

Basically, build Lustre for your kernel on the clients, but use the Lustre 
server kernel on the servers.

Kevin


On Feb 7, 2012, at 9:09 AM, Charles Cummings ccummi...@harthosp.org wrote:

 Hello Everyone,
  
 being the local crafty busy admin for a neuroscience research branch, Lustre 
 seems the only way to go however I'm a bit stuck and need some thoughtful 
 guidance.
  
 My goal  is to setup a virtual OS environment which is a replica of our
 Direct attached storage head node running SLES 11.0 x86 64   Kernel:  
 2.6.27.19-5 default #1 SMP
 and our (2) Dell blade clusters running CentOS 5.3 x86 64   Kernel: 
 2.6.18-128.el5 #1 SMP
 which I now have running as a) SLES 11 same kernel MDS  b) SLES 11 same 
 kernel OSS   and   c) CentOS 5.3 x86 65 same kernel
 and then get Lustre running across it.
  
 The trouble began when i was informed that the Lustre rpm kernel numbers MUST 
 match the OS kernel number EXACTLY due to modprobe errors and mount errors on 
 the client,
 and some known messages on the servers after the rpm installs.
  
 My only direct access to Oracle Lustre downloads is through another person 
 with an Oracle ID who's not very willing to help - i.e. this route is painful
  
 So to explain why I'm stuck:
  
 a) access to oracle downloads is not easy
 b) there is so much risk with altering kernels, given all the applications 
 and stability of the environment you could literally trash the server and 
 spend days recovering - in addition to it being the main storage / resource 
 for research
 c) I can't seem to find after looking Lustre RPMs that match my kernel 
 environment specifically, i.e. the SLES 11 AND CENTOS 5.3
 d) I've never created rpms to a specific kernel version and that would be a 
 deep dive into new territory and frankly another gamble
  
 What's the least painful and least risky to get Lustre working in this 
 prototype which will then lend to production (equally least painful) given 
 these statements - Help !
 Cliff, I could use some details on how specifically wamcloud can fit this 
 scenero - and thanks for all the enlightenment.
  
  
 thanks for your help
 Charles 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

Confidentiality Notice: This e-mail message, its contents and any attachments 
to it are confidential to the intended recipient, and may contain information 
that is privileged and/or exempt from disclosure under applicable law. If you 
are not the intended recipient, please immediately notify the sender and 
destroy the original e-mail message and any attachments (and any copies that 
may have been made) from your system or otherwise. Any unauthorized use, 
copying, disclosure or distribution of this information is strictly prohibited. 
 Email addresses that end with a ?-c? identify the sender as a Fusion-io 
contractor.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OSS Nodes Fencing issue in HPC

2012-01-30 Thread Kevin Van Maren
As I replied earlier, those slow messages are often a result of memory 
allocations taking a long time.  Since zone_reclaim shows up in many of the 
stack traces, that still appears to be a good candidate.

Did you check /proc/sys/vm/zone_reclaim_mode and was it 0?  Did you change it 
to 0 and still have problems?

The same situation that causes the Lustre threads to be slow can also stall the 
heartbeat processes.  Did you increase the heartbeat deadtime timeout value?

Kevin


On Jan 27, 2012, at 1:42 AM, VIJESH EK wrote:

Dear Sir,

I have attached the /var/log/messages from the OSS node ,
Please go through the logs and kindly give me a solution for this issue

Thanks  Regards

VIJESH E K
HCL Infosystems Ltd.
Chennai-6
Mob:+91 99400 96543


On Mon, Jan 23, 2012 at 12:03 PM, VIJESH EK 
ekvij...@gmail.commailto:ekvij...@gmail.com wrote:
Hi,

 I hope all of them are in good spirit

We have a four OSS servers, OSS1 to OSS4 are clustered each other
The Nodes are clustered with OSS1 and OSS2 , OSS3  OSS4.
It was configured six months back, from the beginning itself its creacting
an issue that one of  node is fencing the other node and its goes to the 
shutdown state.
This problem may be happen from two to three weeks timing period.
In the /var/log/messages showing some errors continuously that
 slow start_page_write 57s due to heavy IO load 
Can anybody can help me regarding this issue.


Thanks  Regards

VIJESH E K






messages.3messagesmessages.1messages.2ATT1..txt


Confidentiality Notice: This e-mail message, its contents and any attachments 
to it are confidential to the intended recipient, and may contain information 
that is privileged and/or exempt from disclosure under applicable law. If you 
are not the intended recipient, please immediately notify the sender and 
destroy the original e-mail message and any attachments (and any copies that 
may have been made) from your system or otherwise. Any unauthorized use, 
copying, disclosure or distribution of this information is strictly prohibited. 
 Email addresses that end with a ?-c? identify the sender as a Fusion-io 
contractor.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] How to write more databytes on a file if ost is full

2012-01-26 Thread Kevin Van Maren
Yes, this is the expected behavior.  Lustre is (still) unable to change the 
static stripe information after a file is created, so once a file is allocated 
on an OST, if that OST becomes full, Lustre will not be able to grow the file 
regardless of the space available on other OSTs.

The workaround for this issue is to cp the file to a temporary name on the 
same file system, where it is likely to be allocated on the new OST with free 
space, and then rename the new file over the old one.  Now repeat until you 
have achieved the desired balance of free space.

lfs_migrate is a tool that automates this process somewhat.  See 
http://wiki.lustre.org/manual/LustreManual20_HTML/UserUtilities_HTML.html#50438206_42260

Kevin


On Jan 26, 2012, at 7:09 AM, Eudes wrote:

 Hello,
 
 I use lustre 1.8.5 on Debian.
 
 On lustre 1.8.5, if I have
 
 - One OST with 1 To, 100 Mo free
 
 I add a new OST with 1 To.
 
 On the first ost, I want to add new databytes on a file (fseek at the end),
 and I want to add 500 Mo, it's fail because lustre can't write 400 Mo on the
 new OST.
 
 So my questions are:
 - Is there a solution on Lustre (2.0?)
 - Others clusters have this problem?
 
 
 Thanks
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss


Confidentiality Notice: This e-mail message, its contents and any attachments 
to it are confidential to the intended recipient, and may contain information 
that is privileged and/or exempt from disclosure under applicable law. If you 
are not the intended recipient, please immediately notify the sender and 
destroy the original e-mail message and any attachments (and any copies that 
may have been made) from your system or otherwise. Any unauthorized use, 
copying, disclosure or distribution of this information is strictly prohibited. 
 Email addresses that end with a ?-c? identify the sender as a Fusion-io 
contractor.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OSS Nodes Fencing issue in HPC

2012-01-22 Thread Kevin Van Maren
Well, it sounds like an issue with your HA package configuration.  Likely one 
node is not being responsive enough to heartbeat/are-you-alive messages so the 
other node assumes it has died.  This is likely fixed by increasing the 
deadtime parameter in your HA configuration (try 180 seconds if it is smaller 
than that).  Hard to say, as you omitted any logs, and you didn't even say what 
HA package you are using.

You also didn't indicate which Lustre version you are using.  One of the likely 
candidates for those messages is the kernel having difficulty allocating 
memory.  On many kernels, if /proc/sys/vm/zone_reclaim_mode is not 0, memory 
allocations can take a long time as it keeps looking for the best pages to free 
until pages in the local NUMA node are available.   With the Lustre 1.8.x write 
cache, the memory pressure is substantial (in 1.6.x and earlier, the service 
threads had statically-allocated buffers, but starting with 1.8.x each incoming 
request allocates new pages and frees them back to the page cache).

Kevin


On Jan 22, 2012, at 11:33 PM, VIJESH EK wrote:

Hi,

 I hope all of them are in good spirit

We have a four OSS servers, OSS1 to OSS4 are clustered each other
The Nodes are clustered with OSS1 and OSS2 , OSS3  OSS4.
It was configured six months back, from the beginning itself its creacting
an issue that one of  node is fencing the other node and its goes to the 
shutdown state.
This problem may be happen from two to three weeks timing period.
In the /var/log/messages showing some errors continuously that
 slow start_page_write 57s due to heavy IO load 
Can anybody can help me regarding this issue.


Thanks  Regards

VIJESH E K


ATT1..txt


Confidentiality Notice: This e-mail message, its contents and any attachments 
to it are confidential to the intended recipient, and may contain information 
that is privileged and/or exempt from disclosure under applicable law. If you 
are not the intended recipient, please immediately notify the sender and 
destroy the original e-mail message and any attachments (and any copies that 
may have been made) from your system or otherwise. Any unauthorized use, 
copying, disclosure or distribution of this information is strictly prohibited. 
 Email addresses that end with a ?-c? identify the sender as a Fusion-io 
contractor.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre 1.8.7 kernel patches for SLES11

2011-12-21 Thread Kevin Van Maren
I don't know why it would have been removed.  I find the sd_iostats very useful.

It provides stats for any sd disk.  So if you are using SCSI or SAS, and SATA 
in scsi-emulation mode (ie: no if you get IDE's /dev/hd*, but yes if you get 
/dev/sd*)

Kevin


On Dec 21, 2011, at 9:46 AM, Charland, Denis wrote:


 Any good reason why sd_iostats-2.6.32-vanilla.patch has been removed
 from lustre/kernel_patches/series/2.6-sles11.series in Lustre 1.8.7?

I found that it has been removed as part of “b=23988 Remove sd iostats patch 
from sles11 patch series”.

I’m using this patch series to patch kernel 2.6.32.19-163 in Fedora12. Should I 
avoid applying this patch
when building the patched kernel?

Does this patch apply to SCSI disks only or does it apply to other type of 
disks (SAS/SATA) too?

Denis Charland
UNIX Systems Administrator
National Research Council Canada
ATT1..txt


Confidentiality Notice: This e-mail message, its contents and any attachments 
to it are confidential to the intended recipient, and may contain information 
that is privileged and/or exempt from disclosure under applicable law. If you 
are not the intended recipient, please immediately notify the sender and 
destroy the original e-mail message and any attachments (and any copies that 
may have been made) from your system or otherwise. Any unauthorized use, 
copying, disclosure or distribution of this information is strictly prohibited.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Are there recommended CPUs for Lustre servers?

2011-12-07 Thread Kevin Van Maren
Not sure how much this has improved with the cpu-scaling work in 2.x, but in 
general faster processors are much better than more processors.

Peak performance has been in the range of 4-8 cores, with performance dropping 
after that due to lock contention.  12 cores/node should still be fine, but 
certainly a faster-per-core quad core is likely preferable to a hex-core CPU.

OSS nodes need a good IO/memory subsystem most.  Bull used some large NUMA 
machines, but there are additional complications using eg, multiple IB HCAs for 
performance, so generally the 2-socket range is optimal.

Kevin


On Dec 6, 2011, at 11:57 PM, Oleg Drokin wrote:

 Hello!
 
 On Dec 6, 2011, at 3:44 PM, Sebastian Gutierrez wrote:
 Is there any recommendations on whether or not to use 6core Intel CPUs for 
 the Lustre OSS or MDS nodes?  
 
 While on MDs you do want to have as powerful machine as you can since there 
 is only one,
 I think cpu is not a bottleneck on OSSes.
 Of course you still can install faster CPUs on OSSes, but I think your money 
 would be better spent on memory instead.
 
 Bye,
Oleg
 --
 Oleg Drokin
 Senior Software Engineer
 Whamcloud, Inc.
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss


Confidentiality Notice: This e-mail message, its contents and any attachments 
to it are confidential to the intended recipient, and may contain information 
that is privileged and/or exempt from disclosure under applicable law. If you 
are not the intended recipient, please immediately notify the sender and 
destroy the original e-mail message and any attachments (and any copies that 
may have been made) from your system or otherwise. Any unauthorized use, 
copying, disclosure or distribution of this information is strictly prohibited.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] SNS Status

2011-12-07 Thread Kevin Van Maren
non-existant

On Dec 7, 2011, at 5:55 AM, Yuri wrote:

 Hi guys,
 
 Could someone please tell me what's the current status of SNS (in particular 
 RAID-1)?
 
 Thanks in advance.
 ATT1..txt


Confidentiality Notice: This e-mail message, its contents and any attachments 
to it are confidential to the intended recipient, and may contain information 
that is privileged and/or exempt from disclosure under applicable law. If you 
are not the intended recipient, please immediately notify the sender and 
destroy the original e-mail message and any attachments (and any copies that 
may have been made) from your system or otherwise. Any unauthorized use, 
copying, disclosure or distribution of this information is strictly prohibited.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OST size limitation

2011-11-03 Thread Kevin Van Maren
On Nov 3, 2011, at 12:40 PM, Andreas Dilger wrote:
 
 Not only is the seeking evil (talk to Kevin if you want to run 24TB OSTs on
 flash :-), but the 512-byte sector offset added by the partition table will 
 cause all IO to be misaligned to the underlying device.


It is possible to align partition boundaries, but it is not the default.  
Partitions (if used) should normally be aligned to a multiple of the RAID 
stripe size, although note that some RAID controllers internally compensate for 
the expected misalignment.

See 
http://wikis.sun.com/display/Performance/Aligning+Flash+Modules+for+Optimal+Performance


 Even with flash storage it is much better to align the IO on power-of-two
 boundaries, since the erase blocks cause extra latency if there are read-
 modify-write operations.


That also depends on the flash.  The Fusion-io products have no alignment 
issues.

Kevin


Confidentiality Notice: This e-mail message, its contents and any attachments 
to it are confidential to the intended recipient, and may contain information 
that is privileged and/or exempt from disclosure under applicable law. If you 
are not the intended recipient, please immediately notify the sender and 
destroy the original e-mail message and any attachments (and any copies that 
may have been made) from your system or otherwise. Any unauthorized use, 
copying, disclosure or distribution of this information is strictly prohibited.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OST size limitation

2011-11-02 Thread Kevin Van Maren
On Nov 2, 2011, at 1:48 PM, Charland, Denis wrote:

I read in the Lustre Operations Manual that there is an OST size limitation of 
16 TB on RHEL and
8 TB on other distributions because of the ext3 file system limitation. I have 
a few questions about that.

Why is the limitation 16 TB on RHEL?


16TB is the maximum size RedHat supports.  See 
http://www.redhat.com/rhel/compare/
Larger than that requires bigger changes.

Note that whamcloud's 1.8.6-wc1 claimed support for 24TB LUNs (but see 
http://jira.whamcloud.com/browse/LU-419 ).

Whamcloud's Lustre 2.1 (not sure you'd want to use it) claims support for 128TB 
LUNs.



I plan to use Lustre 1.8.5 on Fedora 12 for a new Lustre file system. What will 
be the OST size limitation?

What is the OST size limitation when using ext4?

16TB with the Lustre-patched RHEL kernel.


Is it preferable to use ext4 instead of ext3?

If the block device has more than 8 TB or 16 TB, it must be partitioned. Is 
there a performance degradation
when a device has multiple partitions compared to a single partition? In other 
words, is it better to have three
8 TB devices with one partition per device than to have one 24 TB device with 
three partitions?


Better to have 3 separate 8TB LUNs.  Different OSTs forcing the same drive 
heads to move to opposite parts of the disk does degrade performance (with a 
single OST moving the drive heads, the block allocator tries to minimize 
movement).



Denis Charland
UNIX Systems Administrator
National Research Council Canada

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.orgmailto:Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Confidentiality Notice: This e-mail message, its contents and any attachments 
to it are confidential to the intended recipient, and may contain information 
that is privileged and/or exempt from disclosure under applicable law. If you 
are not the intended recipient, please immediately notify the sender and 
destroy the original e-mail message and any attachments (and any copies that 
may have been made) from your system or otherwise. Any unauthorized use, 
copying, disclosure or distribution of this information is strictly prohibited.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Anybody have a client running on a 2.6.37 or later kernel?

2011-10-22 Thread Kevin Van Maren
Why not use the RHEL6 kernel on RHEL5?  That's probably much easier.

Kevin


On Oct 21, 2011, at 9:50 PM, Carlson, Timothy S timothy.carl...@pnnl.gov 
wrote:

 Folks,
 
 I've got a need to run a 2.6.37 or later kernel on client machines in order 
 to properly support AMD Interlagos CPUs. My other option is to switch from 
 RHEL 5.x to RHEL 6.x and use the whamcloud 1.8.6-wc1 patchless client (the 
 latest RHEL 6 kernel also supports Interlagos). But I would first like to 
 investigate using a 2.6.37 or later kernel on RHEL 5.
 
 I have a running kernel and started down the path of building Lustre against 
 2.6.37.6 and ran into the changes that have been made wrt to ioctl(), proc 
 structures, etc.  I am *not* a kernel programmer would rather not mess around 
 too much in the source. 
 
 So I am asking if anyone has successfully patched up Lustre to get a client 
 working with 2.6.37.6 or later.
 
 Thanks!
 
 Tim 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

Confidentiality Notice: This e-mail message, its contents and any attachments 
to it are confidential to the intended recipient, and may contain information 
that is privileged and/or exempt from disclosure under applicable law. If you 
are not the intended recipient, please immediately notify the sender and 
destroy the original e-mail message and any attachments (and any copies that 
may have been made) from your system or otherwise. Any unauthorized use, 
copying, disclosure or distribution of this information is strictly prohibited.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS network traffic question

2011-10-12 Thread Kevin Van Maren
I would replace the 1GigE with the 10GigE: have all Ethernet traffic go over 
the 10GigE links, rather than add another tcp1 network for Lustre.  This will 
keep your configuration much simpler, and make the migration as painless as 
possible (just move the IP address to the 10GigE port on the servers).

The MDS traffic _volume_ is much lower than it is for the OSS nodes.  The big 
win from 10GigE would be the lower latency: if you approach 100MB/s of MDS 
traffic, you have much bigger problems than a 10GigE NIC can solve.

Kevin


On Oct 11, 2011, at 3:54 PM, James Robnett wrote:

 
   We have a small lustre install consisting of an MDS and 5 OSS servers.
 Historically the MDS and OSS servers had both a 1Gbit ethernet interface
 (tcp0) to workstations and a QDR IB interface (ib0) to our cluster.
 
   We're planning on adding a MTU 9000 10Gbit ethernet (tcp1) interface
 to the MDS and OSS nodes and workstations for faster access.  Our
 software has a pretty high IO to CPU component.
 
   I just discovered that our MDS can't in fact take another PCIe 8x
 card but it does have a spare GigE port.  The 10gbit Ethernet switch
 can support 1gbit and 10gbit interfaces.
 
   We'd then have 3 networks
 tcp0 at 1gbit to slow clients
 tcp1 at 10gbit to faster clients
 ib0 to cluster
 
   My question is:
 
   Is there a risk of congestion or overrunning that 2nd GigE MDS 
 interface if our workstations and OSS servers communicate over tcp1 at
 10gbit but the MDS tcp1 is connected at 1Gbit.  The bulk of our traffic
 will continue to be between the cluster and lustre over IB but the
 workstations can trivially over run ethernet hence the desire for
 10gbit between them and the OSSes.
 
   My gut feeling is it should be fine, particularly with the larger MTU,
 there's not that much traffic to the MDS but I'd easily believe it if
 somebody said it's risky thing to do.
 
   The alternative is to buy a new MDS and swap disks into it.
 
 James Robnett
 National Radio Astronomy Observatory
 Array Operations Center
 


Confidentiality Notice: This e-mail message, its contents and any attachments 
to it are confidential to the intended recipient, and may contain information 
that is privileged and/or exempt from disclosure under applicable law. If you 
are not the intended recipient, please immediately notify the sender and 
destroy the original e-mail message and any attachments (and any copies that 
may have been made) from your system or otherwise. Any unauthorized use, 
copying, disclosure or distribution of this information is strictly prohibited.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Question about setting max service threads

2011-08-15 Thread Kevin Van Maren
Andreas answered the question asked, and did an excellent job.

But to answer the unasked question, will reducing the thread count  
really fix the problem:

This is often NOT caused by mere disk overload from too many service  
threads.  For example, one recent issue was tracked down to free space  
allocation times being quite large, due to free space bitmaps needing  
to be read from disk.  It has also been common for memory allocations  
to be the major time sink, as with Lustre 1.8 the service threads no  
longer reuse the buffer and have to allocate new memory on every  
request (numa zoned allocations were especially problematic;  
apparently the best pages to free have a tendency of being found on  
the wrong numa node, so it took a lot of time/work to free up space  
on the local numa node to allow the allocation to succeed).

Bug 23826 had patches to track service times better, which will help  
you see how much of an issue this really is.

See also Bug 22516, which strives to normalize server threads per OST,  
rather than per server.

Big 22886 discusses issues with the elevator taking 1MB IOs and  
converting them into odd sizes, which depending on the array could  
also have an impact on IO.

Bug 23805 has some additional rambling along this line as well.

Kevin


On Aug 15, 2011, at 6:36 PM, Andreas Dilger adil...@whamcloud.com  
wrote:

 On 2011-08-15, at 3:58 PM, Mike Hanby wrote:
 Our OSS servers are logging quite a few heavy IO load combined  
 with system load (via 'uptime') being reported in the 100's to  
 several 100's range.

 Lustre: lustre-OST0004: slow commitrw commit 191s due to heavy IO  
 load
 Aug 15 13:00:38 lustre-oss-0-2 kernel: Lustre: Service thread pid  
 17651 completed after 236.04s. This indicates the system was  
 overloaded (too many service threads, or there were not enough  
 hardware resources).
 Lustre: Skipped 1 previous similar message
 Lustre: lustre-OST0004: slow commitrw commit 191s due to heavy IO  
 load
 Lustre: Service thread pid 16436 completed after 210.17s. This  
 indicates the system was overloaded (too many service threads, or  
 there were not enough hardware resources).

 I'd like to test setting the ost_io.threads_max to values lower  
 than 512.

 Question 1: Will this command survive a reboot lctl set_param  
 ost.OSS.ost_io.threads_max=256

 This is only a temporary setting.

 or do I need to also run lctl conf_param  
 ost.OSS.ost_io.threads_max=256?

 The conf_param syntax is (unfortunately) slightly different than the  
 set_param syntax.  You can also set this in /etc/modprobe.d/ 
 lustre.conf:

 options ost oss_num_threads=256
 options mds mds_num_threads=256

 Question 2: Since Lustre does not reduce the number of service  
 threads in use, is there any way I can force the extra running  
 service threads to exit, or is a reboot of the OSS servers the only  
 clean way?

 I had written a patch to do this, but it wasn't landed yet.   
 Currently the only way to limit the thread count is to set this  
 before the number of running threads has exceeded the maximum thread  
 count.

 Cheers, Andreas
 --
 Andreas Dilger
 Principal Engineer
 Whamcloud, Inc.



 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] inconsistent client behavior when creating an empty directory

2011-08-09 Thread Kevin Van Maren
This appears to be the same issue as 
https://bugzilla.lustre.org/show_bug.cgi?id=23459

Kevin


Andrej Filipcic wrote:
 Hi,

 the following code does not work as expected:
 -
 #include sys/stat.h
 #include errno.h
 #include stdio.h

 int main(int argc, char** argv) {

   int rc;
   rc=mkdir(argv[1],S_IRWXU);
   if(rc) perror(failed create dir);
   chown(argv[1],4103,4100);

   struct stat buf;
   /* stat(argv[1],buf); */

   setresuid(0,4103,4100);
   rc=mkdir(argv[1],S_IRWXU);
   if(rc) perror(failed create dir as user);
 }
 -

 initial status:

 # ls -ld /lustre/test
 drwxr-xr-x 2 root root 4096 Aug  9 14:59 /lustre/test
 # ls -l /lustre/test
 total 0

 1) running the test program:

 # /tmp/test /lustre/test/testdir
 failed create dir as user: Permission denied
 # ls -l /lustre/test
 total 4
 drwx-- 2 griduser03 grid 4096 Aug  9 15:02 testdir

 griduser03, grid correspond to uid=4103,gid=4100


 2) running the test program, but with uncommented stat call:
 # /tmp/test /lustre/test/testdir
 failed create dir as user: File exists
 # ls -l /lustre/test
 total 4
 drwx-- 2 griduser03 grid 4096 Aug  9 15:04 testdir


 The code first makes the testdir as root and changes the ownership to uid 
 4103. 
 Then it tries to (re)create the same dir with the user privileges. 

 If stat is called, the code behaves as expected (case 2), but if not (case 
 1), the second mkdir should return EEXIST and not EACCES. Is this behavior 
 expected or is it a client bug? The client runs lustre 1.8.6.

 The code just illustrates, what is actually used in a complex software.

 Andrej

   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] [bug?] mdc_enter_request() problems

2011-08-09 Thread Kevin Van Maren
chas williams - CONTRACTOR wrote:
 On Mon, 08 Aug 2011 12:03:25 -0400
 chas williams - CONTRACTOR c...@cmf.nrl.navy.mil wrote:

   
 later mdc_exit_request() finds this mcw by iterating the list.
 seeing as mcw was allocated on the stack, i dont think you can do this.
 mcw might have been reused by the time mdc_exit_request() gets around
 to removing it.
 

 nevermind. i see this has been fixed in later releases apparently (i
 was looking at 1.8.5). if l_wait_event() returns early (like
 from being interrupted) mdc_enter_request() does the cleanup itself now.
   

That code is unchanged in 1.8.6.

Kevin

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Moving storage from one OSS to another

2011-08-08 Thread Kevin Van Maren
Rafa Griman wrote:
 Hi all :)

 Got a customer

(It is quite bad form to ask for help for a commercial deal without 
using your work email address.  I assume you still work for Bull?)

 with:
  - 1 x S2A9900 (one couplet)
  - 500 x 2 TB drives
  - 4 OSS

 This customer wants to add:
  - 1 x S2A9900 (one couplet)
  - 300 x 2 TB drives
  - 4 OSS

 I know we could just add and restripe the existing files. But the
 customer wants to physically move 100 drives from the existing S2A9900
 to the new one in order to have 400 drives on one S2A9900 and 400
 drives on the other one.

 So my questions are:
  1.- Can this be done (move drives between OSTs) without losing data?
   

First, make sure DDN supports moving drives _data intact_.  The drives 
in each tier will at least have to end up as an ordered tier for this to 
work, but I would guess it isn't quite so simple.

  2.- Can an OST be moved from one OSS to another without losing data?
   

Yes, IF it is done properly.

Step 1, back up all essential data.

Step 2, add the _new server_ as a failover node for the OSTs being 
moved.  Note that this requires a tunefs.lustre --writeconf (with new 
parameters on the OSTs being moved) and re-mounting all the servers.

Step 3, move the drives

Step 4, update the server/failover NIDs, removing the old ones with 
another writeconf pass.


I _strongly_ recommend that you verify their Lustre support contract is 
current, and that you do a dry-run in a test environment before doing it 
live (if nothing else, you do have 4 servers and a new couplet to play 
with).

  3.- Has anyone done this? How?

 I imagine it can be done restriping files/migrating files within the
 Lustre filesystem, removing empty OSTs, ...
   

Well, yes, there is that option as well; use lfs_migrate (at least 2 
passes).

 TIA

Rafa
   

Kevin


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Random OST Numbers chosen in a stripe

2011-08-01 Thread Kevin Van Maren
Johann Lombardi wrote:
 On Fri, Jul 29, 2011 at 04:49:28PM -0400, Roger Spellman wrote:
   
 For a different file:

obdidx   objid  objidgroup
  136884 0x1ae40
  286880 0x1ae00
  446880 0x1ae00
  276880 0x1ae00

 Why is this?

 How can I control it to always be sequential?
 

 It depends on the OST usage imbalance and you can tune the stripe allocation 
 policy with qos_threshold_rr. For more information, please refer to the 
 lustre manual:
 http://wiki.lustre.org/manual/LustreManual20_HTML/LustreProc.html#50438271_pgfId-1296529

 Cheers,
 Johann
   

Also note that newer versions of Lustre sort the OST list even in RR 
mode, so that it will not allocate successive objects from the same OSS 
node.

Kevin

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Failover / reliability using SAD direct-attached storage

2011-07-24 Thread Kevin Van Maren
Mark Hahn wrote:
 It seems an external fibre
 or SAS raid is needed,
 

 to be precise, a redundant-path SAN is needed.  you could do it with 
 commodity disks and Gb, or you can spend almost unlimited amounts on 
 gold-plated disks, FC switches, etc.
   
Many deployments are done without redundant paths, which offer 
additional insurance.

 the range of costs is really quite remarkable, I guess O(100x). 
 compare this to cars where even VERY nice production cars are only 
 a few times more expensive than the most cost-effective ones.
   

You're comparing two mass-market cars: there is a nearly 1000x 
difference in price
between a cheap dune buggy and a Bugatti, but both provide 
transportation for 1-2 people.

 as the idea of loosing the file system if one
 node goes down doesn't seem good, even if temporary.
 

The clients should just hang on the file system until the server is 
again available.
This is not so different from using NFS with hard mounts.

Note that even with failover, the Lustre file system will be down for 
several
minutes, as the HA package has to first detect a problem, and then 
safely startup
Lustre on the backup server, and then Lustre recovery has to occur.

 how often do you expect nodes to fail, and why?

 regards, mark hahn.
   


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Failover / reliability using SAD direct-attached storage

2011-07-22 Thread Kevin Van Maren
Tyler Hawes wrote:
 Apologies if this is a bit newbie, but I'm just getting started, 
 really. I'm still in design / testing stage and looking to wrap my 
 head around a few things.

 I'm most familiar with Fibre Channel storage. As I understand it, you 
 configure a pair of OSS per OST, one actively serving it, the other 
 passively waiting in case the primary OSS fails. Please correct me if 
 I'm wrong...
No, that's basically it.  Lustre works well with FC storage, although a 
full SAN configuration (redundant switch fabrics) is not often used: 
with only 2 servers needing access to each LUN, and bandwidth to storage 
being key, servers are most often directly attached to the FC storage, 
with multiple paths to handle controller/path failure and improve BW.

But to clarify one point, Lustre is not waiting passively on the backup 
server.  Lustre can only be active on one server for a given OST at a 
time.  Some high-availability package, external to Lustre, is 
responsible for ensuring Lustre is active on one server (the OST is 
mounted on one server).  Heartbeat was quite popular, but more people 
have been moving to the more modern packages like Pacemaker.  It is left 
to the HA package to perform failover as necessary, even though most HA 
packages do not perform failover by default if the network or back-end 
storage link goes down (which is where bonded networks and storage 
multipath could come in).

 With SAS/SATA direct-attached storage (DAS), though, it's a little 
 less clear to me. With SATA, I imagine that if an OSS goes down, all 
 it's OSTs go down with it (whether they be internal or external 
 mounted drives), since there is no multipathing. Also, I suppose I'd 
 want a hardware RAID controller PCIe card, which would also preclude 
 failover since it's not going to have cache and configuration mirrored 
 in another OSS's RAID card.

Normally, yes.  Sun shipped quite a bit of Lustre stoage with failover 
using SATA in external enclosures (J4400), but that was special in that 
there were (2) SAS expanders per enclosure, and each drive was connected 
to a SATA MUX to allow both servers access to the SATA drives.

I am glad you understand the hazards of connecting two servers using 
internal raid controllers with external storage.  Until a RAID card is 
developed specifically designed with that in mind (and strictly uses a 
write-though cache), it is a very bad idea.  [For others, please 
consider what would happen to the file system if the raid card has a 
battery backed cache with a bunch of pending writes to get replayed at 
some point _after_ the other server completes recovery.]

If you are using a SAS-attached external RAID enclosure, then it is not 
much different than using a FC-attached RAID.  Ie, the direct-attached 
ST2530 (SAS) can be used in place of a direct-attached ST2540 (FC) with 
the only architecture change being the use of a SAS card/cables instead 
of an FC card/cables.  The big difference between SAS and FC is that 
people are not (yet) building SAS-based SANs.  Already many FC arrays 
have moved to SAS drives on the back end.
http://www.oracle.com/us/products/servers-storage/storage/disk-storage/sun-storage-2500-m2-array-407918.html

 With SAS, there seems to be a new way of doing this that I'm just 
 starting to learn about, but is a bit fuzzy still to me. I see that 
 with things like Storage Bridge Bay storage servers from the likes of 
 Supermicro, there is a method of putting two server motherboards in 
 one enclosure, having an internal 10GigE link between them to keep 
 cache coherency, some sort of software layer to manage that (?), and 
 then you can use inexpensive SAS drives internally and through 
 external JBOD chassis. Is anyone using something like this with Lustre?

Some people have used (or at least toyed with using) DRDB and Lustre, 
but I would not say it is fast, recommended, or a mainstream Lustre 
configuration.  But that is one way to replicate internal storage across 
servers, to allow Lustre failover.

With SAS drives in an external enclosure, it is possible to configure 
shared storage for use with Lustre, although if you are using a JBOD 
rather than a raid controller, there are the normal issues (Linux SW 
raid/LVM layers are not clustered, so you have to ensure they are only 
active on one node at a time).

 Or perhaps I'm not seeing the forest through the trees and Lustre has 
 software features built-in that negate the need for this (such as 
 parity of objects at the server level, so you can loose N+1 OSS)? 
 Bottom line, what I'm after is figuring out what architecture works 
 with inexpensive internal and/or JBOD SAS storage that won't risk data 
 loss with the failure of a single drive or server RAID array...

Lustre does not support redundancy in the file system.  All data 
availability is through RAID protection, combined with server failover.

With internal storage, you lose the failover part.  Sun also delivered 
quite a bit of 

Re: [Lustre-discuss] multipathd or sun rdac driver?

2011-07-20 Thread Kevin Van Maren
David Noriega wrote:
 We already use multipathd in our install already, but this was
 something I wondered about. We use Sun disk arrays and they mention
 the use of their RDAC driver to multipathing on Linux. Since its from
 the vendor, one would think it be better. What does the collective
 think?

 Sun StorageTek RDAC Multipath Failover Driver for Linux
 http://download.oracle.com/docs/cd/E19373-01/820-4738-13/chapsing.html

 David
   

I assume you are using the ST25xx or ST6xxx storage with Lustre?  
Exactly which arrays?

I've been happy with RDAC, but I don't think Oracle has released RHEL6 
support yet
(but Oracle also does not support Lustre servers on RHEL6 yet).

If your multupath config is working (ie, you've tested it by 
unplugging/replugging cables
under load and were happy with the behavior), I'm not going to tell you 
to change.

Kevin

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] multipathd or sun rdac driver?

2011-07-20 Thread Kevin Van Maren
Yes, the controllers are active/passive, so while both controllers 
export each LUN,
only the LUN on the active controller can be used.  In the event of a 
path or controller
failure, RDAC will migrate the lun so that it is active on the working 
controller/path.

Seeing those problems either indicates that your multipath driver 
doesn't properly
support asynchronous multipath, or there is a configuration issue.  I 
believe some
firmware versions allow you to have automatic failover, so the LUN is 
migrated on
access, which was meant to work around multipath drivers that didn't 
migrate the
LUN, but will perform very poorly if more than one path is used.

Note that it is also possible to have multiple paths to each controller, 
which can also
be load balanced or zoned (more useful for eg the ST6780).

[If you want to experience pain, access a LUN from two hosts at the same 
time,
which each host connected to a different controller.  It will also work, 
but be slow,
kindof like reading two CDs at the same time in a CD changer.]

Kevin


David Noriega wrote:
 They are 2540 and I'm running EL5(centos).

 Well the thought came around since I had to rebuild a node after a
 hardware problem. So I went ahead and gave it a shot. I think I posted
 about this problem before somewhere in the mailing list about getting
 stray I/O errors which were for /dev/sdX devices that were the other
 path to the same device(Well thats the idea we came to). Well after
 installing the Sun RDAC module and disabling multipathd, I can happily
 say those messages are gone, so I suppose Sun's module is able to talk
 to the disk array in a better manner then multipathd. Though I haven't
 failed back the lustre ost's to this particular node just yet(will
 wait till the weekend). I'll post again if anything goes wrong, but I
 think going with this RDAC module might be better.

 ps: One thing that has nagged me since Lustre was installed and setup
 by a vendor, was the disk arrays were never setup with initiators or
 hosts in the configuration(Using CAM). We have another similar disk
 array(6140) we setup for another filesystem and I know
 initiators/hosts were setup on the array. I can't say that this has
 caused any problems, but its something in the back of my mind.

 Thanks,
 David

 On Wed, Jul 20, 2011 at 4:15 PM, Kevin Van Maren
 kevin.van.ma...@oracle.com wrote:
   
 David Noriega wrote:
 
 We already use multipathd in our install already, but this was
 something I wondered about. We use Sun disk arrays and they mention
 the use of their RDAC driver to multipathing on Linux. Since its from
 the vendor, one would think it be better. What does the collective
 think?

 Sun StorageTek RDAC Multipath Failover Driver for Linux
 http://download.oracle.com/docs/cd/E19373-01/820-4738-13/chapsing.html

 David

   
 I assume you are using the ST25xx or ST6xxx storage with Lustre?  Exactly
 which arrays?

 I've been happy with RDAC, but I don't think Oracle has released RHEL6
 support yet
 (but Oracle also does not support Lustre servers on RHEL6 yet).

 If your multupath config is working (ie, you've tested it by
 unplugging/replugging cables
 under load and were happy with the behavior), I'm not going to tell you to
 change.

 Kevin


 



   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] how to baseline the performance of a Lustre cluster?

2011-07-18 Thread Kevin Van Maren
Tim Carlson wrote:
 On Fri, 15 Jul 2011, Theodore Omtzigt wrote:

   
 To me it looks very disappointing as we can get 3GB/s from the RAID
 controller aggregating a collection of raw SAS drives on the OSTs, and
 we should be able to get a peak of -5GB/s from QDR IB.

 First question: is this baseline reasonable?
 

 For starters, the theoretical peak of QDR IB is 4GB/s in terms of moving 
 real data. 40Gb/s is the signaling rate and you need to factor in the PCI 
 bus 8/10 encoding. So your 40Gb/s becomes 32Gb/s right off the bat. 

Yes, the (unidirectional) bandwidth of QDR 4x IB is 4GB/s, including 
headers, due to the
InfiniBand 8b/10b encoding.  This is the same (raw) data rate as PCIe 
gen2 x8 (which also
uses 8b/10b encoding, to transmit 10bits for every 8-bit byte).

Interestingly, the upcoming InfiniBand FDR moves to 64b/66b encoding, 
which eliminates most
of the link overhead.  [8b/10b encoding exists to ensure there are 1) an 
equal number of 10 bits,
and 2) to set an upper bounds on the number of sequential 1 or 0 bits at 
a small number.  With
64b/66b there can now be something like 65bits in a row with the same 
value, which makes
it more susceptible to clock skew issues, although the claim is that in 
practice the number
of bits is much smaller as a scrambler is used to randomize the actual 
bits, and the sequences
that correspond to 64 1's or 64 0's will never be used.  So the 
wrong data pattern could
cause more problems.]

To clarify, this 4GB/s is reduced to around 3.2GB/s of data primarily 
due to the smaller packet size
of PCIe (256Bytes), where the headers consume quite a bit of the BW, or 
somewhat less when using
128byte PCIe packets.


While MPI can achieve 3.2GB/s data rates, I have never seen o2ib lnet 
get that high.  As I recall,
something ~2.5 is  more typical.


 Now 
 try and move some data with something like mpi_send and you will see that 
 the real amount of data you can send is really more like 24Gb/s or 3GB/s.

 The test size for ost_survey is pretty small. 30MB. You can increase that 
 with the -s flag. Try at least 100MB.

 You should also turn of checksums to test raw performance. There is an 
 lctl conf_param to do this, but the quick and dirty route on the client is 
 the following bash:

 for OST in /proc/fs/lustre/osc/*/checksums
 do
 echo 0  $OST
 done

 For comparison sake, on my latest QDR connected Lustre file system with 
 LSI 9285-8e controllers connected to JBODs of slowing disks in 11 disk 
 RAID 6 stripes, I get around 500MB/s write and 350MB/s read using 
 ost-survey with 100MB data chunks.

 Your numbers seem reasonable.


 Tim
   


Theodore,

You have jumped straight to testing Lustre over the network, without 
first providing
performance numbers for the disks when locally attached.  (You also 
didn't test the
network, but in the absence of bad links GigE and IB are less variable 
and well understood.)

As for the disk performance, were you able to measure 3GB/s from the 
raid controller, or
what is that number based on?  What was the performance of an individual 
lun (or whatever
backs your OST)?  Are all the OSTs on a single server, and you are testing
them one at a time?

You should be able to get 100+MB/s over GigE, although you may need 2 
OSTs to
do that, and larger IO sizes.  Similarly, if you access multiple OSTs 
simultaneously,
you should be  2GB/s over o2ib.  At least I am assuming you are using 
o2ib and not
just using tcp over InfiniBand, which would be slower.

Kevin

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] how to add force_over_8tb to MDS

2011-07-14 Thread Kevin Van Maren
With one other note: you should have used --mkfsoptions='-t ext4' when 
doing mkfs.lustre, and NOT the force option.
Given that it is already formatted and you don't want to use data, at 
least use the ext4 Lustre RPMs.

Pretty sure you don't need a --writeconf -- you would either run as-is 
with ext4-based ldiskfs or reformat.

The MDT device should be limited to 8TB; I don't think anyone has tested 
a larger MDT.

Kevin


Cliff White wrote:
 This error message you are seeing is what Andreas was talking about - 
 you must use the
 ext4-based version, as you will not need any option with your size 
 LUNS. The 'must use force_over_8tb'
 error is the key here, you most certainly want/need to *.ext4.rpm 
 versions of stuff. 
 cliffw


 On Thu, Jul 14, 2011 at 11:10 AM, Theodore Omtzigt 
 t...@stillwater-sc.com mailto:t...@stillwater-sc.com wrote:

 Michael:

The reason I had to do it on the OST's is because when issuing the
 mkfs.lustre command to build the OST it would error out with the
 message
 that I should use the force_over_8tb mount option. I was not able to
 create an OST on that device without the force_over_8tb option.

 Your insights on the writeconf are excellent: good to know that
 writeconf is solid. Thank you.

 Theo

 On 7/14/2011 1:29 PM, Michael Barnes wrote:
  On Jul 14, 2011, at 1:15 PM, Theodore Omtzigt wrote:
 
  Two part question:
  1- do I need to set that parameter on the MGS/MDS server as well
  No, they are different filesystems.  You shouldn't need to do
 this on the OSTs either.  You must be using an older lustre release.
 
  2- if yes, how do I properly add this parameter on this running
 Lustre
  file system (100TB on 9 storage servers)
  covered
 
  I can't resolve the ambiguity in the documentation as I can't
 find a
  good explanation of the configuration log mechanism that is being
  referenced in the man pages. The fact that the doc for --writeconf
  states This is very dangerous, I am hesitant to pull the
 trigger as
  there is 60TB of data on this file system that I rather not lose.
  I've had no issues with writeconf.  Its nice because it shows
 you the old and new parameters.  Make sure that the changes that
 you made were the what you want, and that the old parameters that
 you want to keep are still in tact.  I don't remember the exact
 circumstances, but I've found settings were lost when doing a
 writeconf, and I had to explictly put these settings in
 tunefs.lustre command to preserve them.
 
  -mb
 
  --
  +---
  | Michael Barnes
  |
  | Thomas Jefferson National Accelerator Facility
  | Scientific Computing Group
  | 12000 Jefferson Ave.
  | Newport News, VA 23606
  | (757) 269-7634 tel:%28757%29%20269-7634
  +---
 
 
 
 
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 mailto:Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss




 -- 
 cliffw
 Support Guy
 WhamCloud, Inc. 
 www.whamcloud.com http://www.whamcloud.com


 

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] inode tuning on shared mdt/mgs

2011-07-02 Thread Kevin Van Maren
Andreas Dilger wrote:
 On 2011-07-01, at 12:03 PM, Aaron Everett aever...@forteds.com 
 mailto:aever...@forteds.com wrote:
 I'm trying to increase the number of inodes available on our shared 
 mdt/mgs. I've tried reformatting using the following:

  mkfs.lustre --fsname fdfs --mdt --mgs --mkfsoptions=-i 2048 
 --reformat /dev/sdb

 The number of inodes actually decreased when I specified -i 2048 vs. 
 leaving the number at default. 

 This os a bit of an anomaly in how 1.8 reports the inode count. You 
 actually do have more inodes on the MDS, but because the MDS might 
 need to use an external block to store the striping layout, it limits 
 the returned inode count to the worst case usage. As the filesystem 
 fills and these external blck

[trying to complete his sentence:]
are not used, the free inode count keeps reporting the same number of 
free inodes, as the number of used inodes goes up.

It is pretty weird, but it was doing the same thing in v1.6

 We have a large number of smaller files, and we're nearing our inode 
 limit on our mdt/mgs. I'm trying to find a solution before simply 
 expanding the RAID on the server. Since there is plenty of disk 
 space, changing the bytes per inode seemed like a simple solution. 

 From the docs:

 Alternately, if you are specifying an absolute number of inodes, use 
 the-N number of inodes option. You should not specify the -i option 
 with an inode ratio below one inode per 1024 bytes in order to avoid 
 unintentional mistakes. Instead, use the -N option.

 What is the format of the -N flag, and how should I calculate the 
 number to use? Thanks for your help!

 Aaron


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] HW RAID - fragmented I/O

2011-06-10 Thread Kevin Van Maren
It's possible there is another issue, but are you sure you (or RedHat) 
are not setting CONFIG_SCSI_MPT2SAS_MAX_SGE in your .config, which is 
preventing it from being set to 256?  I don't have a machine using this 
driver.

You could put #warning in the code to see if you hit the non-256 code 
path when building, or printk the max_sgl_entries in 
_base_allocate_memory_pools.

Kevin



Wojciech Turek wrote:
 Hi Kevin,

 Thanks for very helpful answer. I tried your suggestion and recompiled 
 the mpt2sas driver with the following changes:

 --- mpt2sas_base.h  2010-01-16 20:57:30.0 +
 +++ new_mpt2sas_base.h  2011-06-10 12:53:35.0 +0100
 @@ -83,13 +83,13 @@
 #ifdef CONFIG_SCSI_MPT2SAS_MAX_SGE
 #if CONFIG_SCSI_MPT2SAS_MAX_SGE   16
 #define MPT2SAS_SG_DEPTH   16
 -#elif CONFIG_SCSI_MPT2SAS_MAX_SGE   128
 -#define MPT2SAS_SG_DEPTH   128
 +#elif CONFIG_SCSI_MPT2SAS_MAX_SGE   256
 +#define MPT2SAS_SG_DEPTH   256
 #else
 #define MPT2SAS_SG_DEPTH   CONFIG_SCSI_MPT2SAS_MAX_SGE
 #endif
 #else
 -#define MPT2SAS_SG_DEPTH   128 /* MAX_HW_SEGMENTS */
 +#define MPT2SAS_SG_DEPTH   256 /* MAX_HW_SEGMENTS */
 #endif

 #if defined(TARGET_MODE)

 However I can still that almost 50% of writes and slightly over 50% of 
 reads falls under 512K I/Os
 I am using device-mapper-multipath to manage active/passive paths do 
 you think that could have something to do with the I/O fragmentation?

 Best regards,

 Wojciech

 On 8 June 2011 17:30, Kevin Van Maren kevin.van.ma...@oracle.com 
 mailto:kevin.van.ma...@oracle.com wrote:

 Yep, with 1.8.5 the problem is most likely in the (mpt2sas)
 driver, not in the rest of the kernel.  Driver limits are not
 normally noticed by (non-Lustre) people, because the default
 kernel limits IO to 512KB.

 May want to see Bug 22850 for the changes required eg, for the
 Emulex/lpfc driver.

 Glancing at the stock RHEL5 kernel, it looks like the issue is
 MPT2SAS_SG_DEPTH, which is limited to 128.  This appears to be set
 to match the default kernel limit, but it is possible there is
 also a driver/HW limit.  You should be able to increase that to
 256 and see if it works...


 Also note that the size buckets are power-of-2, so a 1MB entry
 is any IO  512KB and = 1MB.

 If you can't get the driver to reliably do full 1MB IOs, change to
 a 64KB chunk and set max_sectors_kb to 512.  This will help ensure
 you get aligned, full-stripe writes.

 Kevin



 Wojciech Turek wrote:

 I am setting up a new lustre filesystem using LSI engenio
 based disk
 enclosures with integrated dual RAID controllers. I configured
 disks
 into 8+2 RAID6 groups using 128kb segment size (chunk size). This
 hardware uses mpt2sas kernel module on the Linux host side. I
 use the
 whole block device for an OST (to avoid any alignment issues).
 When
 running sgpdd-survey I can see high through numbers (~3GB/s write,
 5GB/s read), Also controllers stats show that number of IOPS =
 number
 of MB/s. However as soon as I put ldiskfs on the OSTs,
 obdfilter shows
 slower results (~2GB/s write , ~2GB/s read ) and controller
 stats show
 more then double IOPS than MB/s. Looking at output from iostat
 -m -x 1
 and brw_stats I can see that a large number of I/O operations are
 smaller than 1MB, mostly 512kb.  I know that there was some
 work done
 on optimising the kernel block device layer to process 1MB I/O
 requests and that those changes were committed to Lustre
 1.8.5. Thus I
 guess this I/O chopping happens below the Lustre stack, maybe
 in the
 mpt2sas driver?

 I am hoping that someone in Lustre community can shed some
 light on to
 my problem.

 In my setup I  use:
 Lustre 1.8.5
 CentOS-5.5

 Some parameters I tuned from defaults in CentOS:
 deadline I/O scheduler

 max_hw_sectors_kb=4096
 max_sectors_kb=1024


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lustre ofed compatibility

2011-06-09 Thread Kevin Van Maren
 to recompile linux kernel with increased stack size,  
because lustre and ofed may use up stack (both are stack greedy)  
and thus lead to system hang issue.


YiLei


On Thu, Jun 2, 2011 at 1:36 AM, Kevin Van Maren kevin.van.ma...@oracle.com 
 wrote:
OFED 1.5.1 should work fine with Lustre 1.8.4, although I believe  
more
people are using the in-kernel OFED now: Lustre (finally) defaulted  
to
the in-kernel OFED for RedHat, so it is no longer _necessary_ to  
build

either OFED or Lustre.

Kevin


Edward Walter wrote:
 Hi List,

 We're getting ready to upgrade the OS/software  stack on one of our
 clusters and I'm looking at which Lustre and OFED versions will  
work best.


 It looks like the changelog for 1.8.4 and the compatibility  
matrix have

 conflicting information.

 The Lustre compatibility matrix indicates that on Lustre 1.8.4; the
 highest OFED revision with o2iblnd support is 1.4.2:
 http://wiki.lustre.org/index.php/Lustre_Release_Information

 The changelog for 1.8.4 indicates that o2iblnd is supported with  
OFED 1.5.1:

 http://wiki.lustre.org/index.php/Change_Log_1.8#Changes_from_v1.8.3_to_v1.8.4


 Can someone clarify whether 1.8.4 supports o2iblnd with OFED  
1.5.1?  Are

 there any pitfalls to this configuration?  Has anyone found any
 instabilities with this configuration?

 Thanks much.

 -Ed Walter
 Carnegie Mellon University
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss




___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Pardon my stupidity: IOH?

2011-06-03 Thread Kevin Van Maren
The I/O Hub, which provides the PCI Express lanes to the processor.  See:
http://en.wikipedia.org/wiki/Intel_X58


Ms. Megan Larko wrote:
 Greetings,

 Please pardon my ignorance, what is this IOH to which the recent
 thread OSSes on dual IOH motherboards has been referring?

 Thanks,
 megan
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OSSes on dual IOH motherboards

2011-06-02 Thread Kevin Van Maren
Mark,

In addition to thread pinning, see also Bug 22078, which allows a 
different network interface to be used for different OSTs on the same 
server: a single IB interface is not enough to saturate one IOH, let 
alone multiple.

Normally all the threads are in a shared pool, where any thread can 
service any incoming request for any OST.

The most common server configuration is probably still dual-socket 
single IOH.

Kevin


Andreas Dilger wrote:
 Look for the Bull NUMIOA presentation from the recent LUG. The short story is 
 that OST thread pinning is critical to getting good performance.  The numbers 
 are something like 3.6GB/s without, and 6.0 GB/s with thread affinity. 

 Cheers, Andreas

 On 2011-06-02, at 7:23 PM, Mark Nelson m...@msi.umn.edu wrote:

   
 Hi List,

 I was wondering if anyone here has looked at the performance 
 characteristics of lustre OSSes on dual tylersburg motherboards with 
 raid controllers split up onto separate IO hubs.  I imagine that without 
 proper pinning of service threads to the right CPUs/IOH and memory pools 
 this could cause some nasty QPI contention.  Is this actually a problem 
 in practice?  Is it possible to pin service threads in a reasonable way 
 based on which OST is involved?  Anyone doing this on purpose to try and 
 gain more overall PCIE bandwidth?

 I imagine that in general it's probably best to stick with a single 
 socket single IOH OSS.  No pinning to worry about, very direct QPI 
 setup, consistent performance characteristics, etc.

 Thanks,
 Mark
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lustre ofed compatibility

2011-06-01 Thread Kevin Van Maren
OFED 1.5.1 should work fine with Lustre 1.8.4, although I believe more 
people are using the in-kernel OFED now: Lustre (finally) defaulted to 
the in-kernel OFED for RedHat, so it is no longer _necessary_ to build 
either OFED or Lustre.

Kevin


Edward Walter wrote:
 Hi List,

 We're getting ready to upgrade the OS/software  stack on one of our 
 clusters and I'm looking at which Lustre and OFED versions will work best.

 It looks like the changelog for 1.8.4 and the compatibility matrix have 
 conflicting information.

 The Lustre compatibility matrix indicates that on Lustre 1.8.4; the 
 highest OFED revision with o2iblnd support is 1.4.2:
 http://wiki.lustre.org/index.php/Lustre_Release_Information

 The changelog for 1.8.4 indicates that o2iblnd is supported with OFED 1.5.1:
 http://wiki.lustre.org/index.php/Change_Log_1.8#Changes_from_v1.8.3_to_v1.8.4


 Can someone clarify whether 1.8.4 supports o2iblnd with OFED 1.5.1?  Are 
 there any pitfalls to this configuration?  Has anyone found any 
 instabilities with this configuration?

 Thanks much.

 -Ed Walter
 Carnegie Mellon University
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] mv_sata module for rhel5 and write through patch

2011-05-26 Thread Kevin Van Maren
Brock Palen wrote:
 We are (finally) updating our x4500's to rhel5 and luster 1.8.5 from rhel4 
 and 1.6.7

 On rhel4 we had used the patch from:
 https://bugzilla.lustre.org/show_bug.cgi?id=14040

 for the mv_sata  module.

 Is this still recommended on rhel5? To use the mv_sata module over the stock 
 redhat sata_mv as well as applying this patch?  That patch is quite old is 
 there a newer one?
   

I don't know: the last I heard was that the upcoming rhel 5.3 was to 
have an in-tree Marvell driver that worked.  If your system is still 
under support, I'd contact Oracle support for information about running 
RHEL5 on the x4500.

You do want to ensure the write-back cache is disabled on the drive, but 
you may be able to do that with udev scripts.  See Bug 17462 for an 
example for the J4400.

 What are other x4500/thumper users running?

 Also I will do some digging on the list but why is lustre 2.0 not the 
 'production' version? We are planning on 1.8.x for now but if 2.0 is stable 
 we would install that one.
   

Lustre 2.0 is not being widely used, and would not be covered by an 
Oracle support contract.  It is strongly recommended to run production 
systems on 1.8.x rather than 2.0.  If you really want to try Lustre 2.x, 
you will want to use something newer than 2.0: maybe check with 
lustre...@googlegroups.com for the current status of the whamcloud git 
repository?

 Can we upgrade directly from 1.6 to 2.0 if we did this?

 Brock Palen
 www.umich.edu/~brockp
 Center for Advanced Computing
 bro...@umich.edu
 (734)936-1985

   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Checksums of files on disk

2011-05-25 Thread Kevin Van Maren
Christopher J.Walker wrote:
 The application I use, StoRM[1] can store checksums on disk in an
 extended user attribute - and use that to ensure the integrity of files
 on disk. The algorithm currently used is adler32. The intention is to
 perform end to end checksumming from file creation through storage,
 transfer over the WAN and storage at a site.

 Looking at
 http://wiki.lustre.org/manual/LustreManual20_HTML/ManagingFileSystemIO.html#50438211_pgfId-1291975

 I see that Lustre has some checksum support (though not for checksumming
 the file on the OST - so we'd still need to use the user attribute for
 that).


 http://wiki.lustre.org/manual/LustreManual18_HTML/LustreTuning.html#50651264_pgfId-1291287


 Is the value of the checksum user accessible? Or to be more specific,
 I'd potentially get a big speedup if I were able to ask the diskserver
 to tell me the checksum of a file without actually transferring it over
 the network. Is it easy to do this?
   

No, the checksum is not currently available, and is not being stored on 
disk.  That being said, feel free to send patches!  There were some 
plans to merge the client-side checksum with the ZFS checksum when the 
backing store is ZFS, but I have not been following the ZFS status 
closely enough to know the status of that enhancement.  Do note that the 
Lustre checksums only cover the RPC, so at best each 1MB file chunk 
would have a separate checksum, generated on the client before doing the 
RPC (so not quite as end-to-end as an application checksum).

Also note that the checksums are not used when using mmap().  See Bug 
11742 for the details (it is sent, but failures are ignored).

Kevin

 Chris

 [1] http://storm.forge.cnaf.infn.it/home This is an SRM implementation
 we use to give an grid authentication to our storage (we store data for
 the LHC).
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] [Lustre-community] Poor multithreaded I/O performance

2011-05-24 Thread Kevin Van Maren
[Moved to Lustre-discuss]


However, if I spawn 8 threads such that all of them write to the same 
file (non-overlapping locations), without explicitly synchronizing the 
writes (i.e. I dont lock the file handle)


How exactly does your multi-threaded application write the data?  Are 
you using pwrite to ensure non-overlapping regions or are they all just 
doing unlocked write() operations on the same fd to each write (each 
just transferring size/8)?  If it divides the file into N pieces, and 
each thread does pwrite on its piece, then what each OST sees are 
multiple streams at wide offsets to the same object, which could impact 
performance.

If on the other hand the file is written sequentially, where each thread 
grabs the next piece to be written (locking normally used for the 
current_offset value, so you know where each chunk is actually going), 
then you get a more sequential pattern at the OST.

If the number of threads maps to the number of OSTs (or some modulo, 
like in your case 6 OSTs per thread), and each thread owns the piece 
of the file that belongs to an OST (ie: for (offset = thread_num * 6MB; 
offset  size; offset += 48MB) pwrite(fd, buf, 6MB, offset); ), then 
you've eliminated the need for application locks (assuming the use of 
pwrite) and ensured each OST object is being written sequentially.

It's quite possible there is some bottleneck on the shared fd.  So 
perhaps the question is not why you aren't scaling with more threads, 
but why the single file is not able to saturate the client, or why the 
file BW is not scaling with more OSTs.  It is somewhat common for 
multiple processes (on different nodes) to write non-overlapping regions 
of the same file; does performance improve if each thread opens its own 
file descriptor?

Kevin


Wojciech Turek wrote:
 Ok so it looks like you have in total 64 OSTs and your output file is 
 striped across 48 of them. May I suggest that you limit number of 
 stripes, lets say a good number to start with would be 8 stripes and 
 also for best results use OST pools feature to arrange that each 
 stripe goes to OST owned by different OSS.

 regards,

 Wojciech

 On 23 May 2011 23:09, kme...@cs.uh.edu mailto:kme...@cs.uh.edu wrote:

 Actually, 'lfs check servers' returns 64 entries as well, so I
 presume the
 system documentation is out of date.

 Again, I am sorry the basic information had been incorrect.

 - Kshitij

  Run lfs getstripe your_output_file and paste the output of
 that command
  to
  the mailing list.
  Stripe count of 48 is not possible if you have max 11 OSTs (the
 max stripe
  count will be 11)
  If your striping is correct, the bottleneck can be your client
 network.
 
  regards,
 
  Wojciech
 
 
 
  On 23 May 2011 22:35, kme...@cs.uh.edu
 mailto:kme...@cs.uh.edu wrote:
 
  The stripe count is 48.
 
  Just fyi, this is what my application does:
  A simple I/O test where threads continually write blocks of size
  64Kbytes
  or 1Mbyte (decided at compile time) till a large file of say,
 16Gbytes
  is
  created.
 
  Thanks,
  Kshitij
 
   What is your stripe count on the file,  if your default is 1,
 you are
  only
   writing to one of the OST's.  you can check with the lfs
 getstripe
   command, you can set the stripe bigger, and hopefully your
  wide-stripped
   file with threaded writes will be faster.
  
   Evan
  
   -Original Message-
   From: lustre-community-boun...@lists.lustre.org
 mailto:lustre-community-boun...@lists.lustre.org
   [mailto:lustre-community-boun...@lists.lustre.org
 mailto:lustre-community-boun...@lists.lustre.org] On Behalf Of
   kme...@cs.uh.edu mailto:kme...@cs.uh.edu
   Sent: Monday, May 23, 2011 2:28 PM
   To: lustre-commun...@lists.lustre.org
 mailto:lustre-commun...@lists.lustre.org
   Subject: [Lustre-community] Poor multithreaded I/O performance
  
   Hello,
   I am running a multithreaded application that writes to a common
  shared
   file on lustre fs, and this is what I see:
  
   If I have a single thread in my application, I get a bandwidth of
  approx.
   250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I
 spawn 8
   threads such that all of them write to the same file
 (non-overlapping
   locations), without explicitly synchronizing the writes (i.e.
 I dont
  lock
   the file handle), I still get the same bandwidth.
  
   Now, instead of writing to a shared file, if these threads
 write to
   separate files, the bandwidth obtained is approx. 700 Mbytes/sec.
  
   I would ideally like my multithreaded application to see similar
  scaling.
   Any ideas why the performance is limited and any workarounds?
  
   Thank you,
   Kshitij
  
  

Re: [Lustre-discuss] Two questions about the tuning of Lustre file system.

2011-05-20 Thread Kevin Van Maren
What exactly were you testing?  I have no idea how to interpret your 
numbers.  A single client reading from a single file?  One file per OST, 
or file striped across all OSTs?  Is the Lustre file system idle except 
for your test?

In general, start with the pieces:
1) make sure the network is sane.  Try measuring BW to/from each node 
(client and server) to ensure all the cables are good.  For your 
configuration, you should be able to measure ~3.2GB/s (unidirectional) 
using large MPI messages.  While I prefer to use MPI, some people use 
the lnet_selftest.
2) make sure each OST is sane.  For each OST, create a file that is only 
striped on that OST.  Make sure a client can read/write each of these 
files as expected.  Be sure you transfer much more data than the 
client+server RAM sizes.

Many issues are sorted out just getting both 1  2 in good shape.

Kevin



Tanin wrote:
 Dear all, 

 I have two question regarding the performance of Lustre System. 
 Currently, we have 5 OSS nodes, and each OSS carries 8 OST's. All the 
 nodes (including the MDT/MGS node and client node) are connected to a 
 Mellanox MTS 3600 InfiniBand switch using RDMA for data transfer. The 
 bandwidth of the network is 40Gbps. The kernel version is  'Linux 
 2.6.18-164.11.1.el5_lustre.1.8.3 #1 SMP Fri Apr 9 18:00:39 MDT 2010 
 x86_64 x86_64 x86_64 GNU/Linux'. OS is RHEL 5.5.  Lustre version is 
 1.8.3. OFED Version is 1.5.2. IB HCA is Mellanox Technologies MT26428 
 ConnectX VPI PCIe IB QDR.

 And I did a simple test on the client side to see the peak data 
 reading performance. Here is the data:

 #time  Data transferred  Bandwidth
 2 sec   2.18 GBytes 8.71 Gbits/sec
 2 sec   2.06 GBytes 8.24 Gbits/sec
 2 sec   2.10 GBytes 8.40 Gbits/sec
 2 sec   1.93 GBytes 7.73 Gbits/sec
 2 sec   1.50 GBytes 6.02 Gbits/sec
 2 sec   420.00 MBytes   1.64 Gbits/sec
 2 sec   2.19 GBytes 8.75 Gbits/sec
 2 sec   2.08 GBytes 8.32 Gbits/sec
 2 sec   2.08 GBytes 8.32 Gbits/sec
 2 sec   1.99 GBytes 7.97 Gbits/sec
 2 sec   1.80 GBytes 7.19 Gbits/sec
 *2 sec   160.00 MBytes   640.00 Mbits/sec*
 2 sec   2.15 GBytes 8.59 Gbits/sec
 2 sec   2.13 GBytes 8.52 Gbits/sec
 2 sec   2.15 GBytes 8.59 Gbits/sec
 2 sec   2.09 GBytes 8.36 Gbits/sec
 2 sec   2.09 GBytes 8.36 Gbits/sec
 2 sec   2.07 GBytes 8.28 Gbits/sec
 2 sec   2.15 GBytes 8.59 Gbits/sec
 2 sec   2.11 GBytes 8.44 Gbits/sec
 2 sec   2.05 GBytes 8.20 Gbits/sec
 *2 sec   0.00 Bytes  0.00 bits/sec*
 *2 sec   0.00 Bytes  0.00 bits/sec*
 2 sec   1.95 GBytes 7.81 Gbits/sec
 2 sec   2.14 GBytes 8.55 Gbits/sec
 2 sec   1.99 GBytes 7.97 Gbits/sec
 2 sec   2.00 GBytes 8.01 Gbits/sec
 2 sec   370.00 MBytes   1.45 Gbits/sec
 2 sec   1.96 GBytes 7.85 Gbits/sec
 2 sec   2.03 GBytes 8.12 Gbits/sec
 2 sec   1.89 GBytes 7.58 Gbits/sec
 2 sec   1.94 GBytes 7.77 Gbits/sec
 2 sec   640.00 MBytes   2.50 Gbits/sec
 2 sec   1.47 GBytes 5.90 Gbits/sec
 2 sec   1.94 GBytes 7.77 Gbits/sec
 2 sec   1.90 GBytes 7.62 Gbits/sec
 2 sec   1.94 GBytes 7.77 Gbits/sec
 2 sec   1.18 GBytes 4.73 Gbits/sec
 2 sec   940.00 MBytes   3.67 Gbits/sec
 2 sec   1.97 GBytes 7.89 Gbits/sec
 2 sec   1.93 GBytes 7.73 Gbits/sec
 2 sec   1.87 GBytes 7.46 Gbits/sec
 2 sec   1.77 GBytes 7.07 Gbits/sec
 2 sec   320.00 MBytes   1.25 Gbits/sec
 2 sec   1.97 GBytes 7.89 Gbits/sec
 2 sec   2.00 GBytes 8.01 Gbits/sec
 2 sec   1.89 GBytes 7.58 Gbits/sec
 2 sec   1.93 GBytes 7.73 Gbits/sec
 2 sec   350.00 MBytes   1.37 Gbits/sec
 2 sec   1.77 GBytes 7.07 Gbits/sec
 2 sec   1.92 GBytes 7.70 Gbits/sec
 2 sec   2.05 GBytes 8.20 Gbits/sec
 2 sec   2.01 GBytes 8.05 Gbits/sec
 2 sec   710.00 MBytes   2.77 Gbits/sec
 2 sec   1.59 GBytes 6.37 Gbits/sec
 2 sec   2.00 GBytes 8.01 Gbits/sec
 2 sec   710.00 MBytes   2.77 Gbits/sec
 2 sec   1.59 GBytes 6.37 Gbits/sec
 2 sec   2.00 GBytes 8.01 Gbits/sec
 2 sec   1.88 GBytes 7.54 Gbits/sec
 2 sec   1.62 GBytes 6.48 Gbits/sec


 As you can see, although the peak bandwidth can reach 8.71Gbps, the 
 performance is quite unstable(sometimes the bandwidth just gets 
 chocked). All the OSS node seems to stop reading data simultaneously. 
 I tried to group up different OSTs and turn on/off the checksum, this 
 still happens. Does anybody get a hint of the reason?

 2. As we know, when reading data from lustre client, the data is moved 
 from 

Re: [Lustre-discuss] Anybody actually using Flash (Fusion IO specifically) for meta data?

2011-05-19 Thread Kevin Van Maren
Dardo D Kleiner - CONTRACTOR wrote:
 Short answer: of course it works - they're just block devices after all - but 
 you'll find that you won't realize the performance gains you might expect (at 
 least not for an MDT).
   

Yes.  See the email thread improving metadata performance and Robin 
Humble's talk at LUG.  The MDT disk is rarely the bottleneck (although 
that could change with full size-on-mds support), which others had 
discovered using a ram-based (tmpfs) MDT.

As for putting the entire filesystem on flash, sure that would be pretty 
nifty, but expensive.  Not being able to do failover, with storage on 
internal PCIe cards, is a downside.

 Aside from simply being fast OSTs, there are several areas that would allow 
 Lustre to take advantage of these kinds of devices:

 1) SMP scaling for the MDS - the problem right now is that the low latency of 
 these devices really shines best when you have many threads scattering small 
 I/O.  The current (1.8.x) Lustre MDS doesn't 
 do this.
   
SMP scaling is a big issue.  In Lustre 1.8.x the maximum performance is 
not more than 8 CPUs (maybe fewer) for the MDT -- additional cpu cores 
results in _lower_ performance.  There are patches for Lustre 2.x to 
improve SMP scaling, but I haven't tested a workload.

 2) Flashcache/bcache over traditional disk storage (OST or MDT) - this can be 
 done today, of course.  There's some interop issues in my testing, but when 
 it works it does what it says it does.  It 
 still won't really help an MDT though.
 3) Targeted device mapping of the metadata portions of an OST on traditional 
 disk (e.g. extent lists) onto flash.

 #1 is substantial work (ongoing I believe).  #2 is pretty nifty, basically 
 grow your local page cache beyond RAM - helps when hot working set is 
 large.  #3 is trickier and though I haven't tried it 
 I understand there's real effort ongoing in this regard.
   

flex_bg is in ext4, which allows the inodes to be packed together.

 Filesystem size in this discussion is mostly irrelevant for an MDT, its just 
 whether or not the device is big enough for the number of objects (a few 
 million is *not* many).  A huge number of clients 
 thrashing about creating/modifying/deleting is where these things have the 
 most potential.

 - Dardo

 On 5/16/11 2:58 PM, Carlson, Timothy S wrote:
   
 Folks,

 I know that flash based technology gets talked about from time to time on 
 the list, but I was wondering if anybody has actually implemented FusionIO 
 devices for metadata. The last thread I can find on the mailing list that 
 relates to this topic dates from 3 years ago. The software driving the 
 Fusion cards has come quite a ways since then and I've got good experience 
 using the device as a raw disk. I'm just fishing around to see if anybody 
 has implemented one of these devices in a reasonably sized Lustre config 
 where reasonably is left open to interpretation. I'm thinking500T and a 
 few million files.

 Thanks!

 Tim

 

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] software RAID1 in RHEL5

2011-05-19 Thread Kevin Van Maren
Adesanya, Adeyemi wrote:
 I'm discussing the proposed architecture for two new Lustre 1.8.x 
 filesystems. We plan to use a failover pair of MDS nodes (active-active), 
 with each MDS serving an MDT.  The MDTs will be housed in external storage 
 but we would like to implement redundancy across more than one storage array 
 by using software RAID1.

 The Lustre documentation mentions using linux md to set up software RAID1 or 
 RAID10 for MDTs. Does the RAID1 implementation in the Lustre 1.8.x RHEL5 
 kernel do an adequate job of ensuring consistency across mirrored devices 
 (compared to a hardware RAID1 implementation)?
   

Adequate, probably.  As correct as hardware raid, doubtful.  Without 
special hardware, or doing things that kill performance, there will 
always remain some corner cases.

The issue is what happens for writes that are in process when you have a 
crash/reboot/power loss: it is possible for them to make it to one disk, 
but not the other.  So it is possible to believe they are on disk, and 
proceed accordingly, when they are only on one copy, and are lost if 
that disk fails.  Even worse, Linux alternates reads, so in theory it 
could be there one time and gone the next.

The good news is that writes should(!) not be marked as on disk until 
both disks have said it is written.  So you could do an md check, and 
if needed do a repair before eg, replaying the journal (mounting the 
file system doing fsck, etc).  Even if the MD resync takes the older 
copy and undoes a write, it should not have been a write that was 
expected to have made it to stable storage, so the normal Lustre 
recovery mechanisms should be able to replay it.  Assuming, that is, 
that this is done _before_ you mount the device.

Kevin

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client question

2011-05-13 Thread Kevin Van Maren
See bug 24264 -- certainly possible that the raid controller corrupted  
your filesystem.

If you remove the new drive and reboot, does the file system look  
cleaner?

Kevin


On May 13, 2011, at 11:39 AM, Zachary Beebleson zbee...@math.uchicago.edu 
  wrote:


 We recently had two raid rebuilds on a couple storage targets that  
 did not go
 according to plan. The cards reported a successful rebuild in each  
 case, but
 ldiskfs errors started showing up on the associated OSSs and the  
 effected OSTs
 were  remounted read-only. We are planning to migrate off the data,  
 but we've
 noticed that some clients are getting i/o errors, while others are  
 not. As an
 example, a file that has a stripe on at least one affected OST could  
 not be
 read on one client, i.e. I received a read-error trying to access  
 it, while it
 was perfectly readable and apparently uncorrupted on another (I am  
 able to
 migrate the file to healthy OSTs by copying to a new file name). The  
 clients
 with the i/o problem see inactive devices corresponding to the read- 
 only OSTs
 when I issue a 'lfs df', while the others without the i/o problems  
 report the
 targets as normal. Is it just that many clients are not aware of an  
 OST problem
 yet? I need clients with minimal I/O disruptions in order to migrate  
 as much
 data off as possible.

 A client reboot appears to awaken them to the fact that there are  
 problems with
 the OSTs. However, I need them to be able to read the data in order  
 to migrate
 it off. Is there a way to reconnect the clients to the problematic  
 OSTs?

 We have dd-ed copies of the OSTs to try e2fsck against them, but the  
 results
 were not promising. The check aborted with:

 --
 Resize inode (re)creation failed: A block group is missing an inode
 table.Continue? yes

 ext2fs_read_inode: A block group is missing an inode table while  
 reading inode
 7 in recreate inode
 e2fsck: aborted
 --

 Any advice would be greatly appreciated.
 Zach
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] two iSCSI lun for OST conf in RAID 1

2011-05-13 Thread Kevin Van Maren
You could use software RAID below Lustre to present an md device.  See 
the mdadm command.

Kevin


Roberto Scudeller wrote:
 Hi all,

 I need help. Is possible config 2 lun (of the 2 different storages) 
 for OST in RAID1?

 I need the same data replicated in 2 storages for data recovery 
 (security and et.).

 Cheers,

 -- 
 Roberto Scudeller

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client question

2011-05-13 Thread Kevin Van Maren
It sounds like it is working better.  Did the clients recover?  I would 
have re-run fsck before mounting it again, and moving the data off may 
still be the best plan.  Since dropping the rebuilt drive reduced the 
corruption, certainly contact your raid vendor over this issue.

Kevin


Zachary Beebleson wrote:
 Kevin,

 I just failed the drive and remounted. A basic 'df' hangs when it gets to
 the mount point, but /proc/fs/lustre/health_check reports everything is
 healthy. 'lfs df' on a client reports the OST is active, where it was
 inactive before. However, now I'm working with a degraded volume, but it
 is raid 6. Should I try another rebuild or just proceed with the
 mirgration off of this OST asap?

 Thanks,
 Zach

 PS. Sorry for the repeat message
 On Fri, 13 May 2011, Kevin Van Maren wrote:

 See bug 24264 -- certainly possible that the raid controller 
 corrupted your filesystem.

 If you remove the new drive and reboot, does the file system look 
 cleaner?

 Kevin


 On May 13, 2011, at 11:39 AM, Zachary Beebleson 
 zbee...@math.uchicago.edu wrote:


 We recently had two raid rebuilds on a couple storage targets that 
 did not go
 according to plan. The cards reported a successful rebuild in each 
 case, but
 ldiskfs errors started showing up on the associated OSSs and the 
 effected OSTs
 were  remounted read-only. We are planning to migrate off the data, 
 but we've
 noticed that some clients are getting i/o errors, while others are 
 not. As an
 example, a file that has a stripe on at least one affected OST could 
 not be
 read on one client, i.e. I received a read-error trying to access 
 it, while it
 was perfectly readable and apparently uncorrupted on another (I am 
 able to
 migrate the file to healthy OSTs by copying to a new file name). The 
 clients
 with the i/o problem see inactive devices corresponding to the 
 read-only OSTs
 when I issue a 'lfs df', while the others without the i/o problems 
 report the
 targets as normal. Is it just that many clients are not aware of an 
 OST problem
 yet? I need clients with minimal I/O disruptions in order to migrate 
 as much
 data off as possible.

 A client reboot appears to awaken them to the fact that there are 
 problems with
 the OSTs. However, I need them to be able to read the data in order 
 to migrate
 it off. Is there a way to reconnect the clients to the problematic 
 OSTs?

 We have dd-ed copies of the OSTs to try e2fsck against them, but the 
 results
 were not promising. The check aborted with:

 --
 Resize inode (re)creation failed: A block group is missing an inode
 table.Continue? yes

 ext2fs_read_inode: A block group is missing an inode table while 
 reading inode
 7 in recreate inode
 e2fsck: aborted
 --

 Any advice would be greatly appreciated.
 Zach
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Fragmented I/O

2011-05-12 Thread Kevin Van Maren
Kevin Hildebrand wrote:

 The PERC 6 and H800 use megaraid_sas, I'm currently running 
 00.00.04.17-RH1.

 The max_sectors numbers (320) are what is being set by default- I am 
 able to set it to something smaller than 320, but not larger.

Right.  You can not set max_sectors_kb larger than max_hw_sectors_kb 
(Linux normally defaults most drivers to 512, but Lustre sets them to be 
the same): you may want to instrument your HBA driver to see what is 
going on (ie, why the max_hw_sectors_kb is  1024).  I don't know if it 
is due to a driver limitation or a true hardware limit.

Most drivers have a limit of 512KB by default; see Bug 22850 for the 
patches that fixed the QLogic and Emulex fibre channel drivers.

Kevin

 Kevin

 On Wed, 11 May 2011, Kevin Van Maren wrote:

 You didn't say, but I think they are LSI-based: are you using the mptsas
 driver with the PERC cards?  Which driver version?

 First, max_sectors_kb should normally be set to a power of 2 number,
 like 256, over an odd size like 320.  This number should also match the
 native raid size of the device, to avoid read-modify-write cycles.  (See
 Bug 22886 on why not to make it  1024 in general).

 See Bug 17086 for patches to increase the max_sectors_kb limitation for
 the mptsas driver to 1MB, or the true hardware maximum, rather than a
 driver limit; however, the hardware may still be limited to sizes  1MB.

 Also, to clarify the sizes: the smallest bucket = transfer_size is the
 one incremented, so a 320KB IO increments the 512KB bucket.  Since your
 HW says it can only do a 320KB IO, there will never be a 1MB IO.

 You may want to instrument your HBA driver to see what is going on (ie,
 why the max_hw_sectors_kb is  1024).

 Kevin


 Kevin Hildebrand wrote:
 Hi, I'm having some performance issues on my Lustre filesystem and it
 looks to me like it's related to I/Os getting fragmented before being
 written to disk, but I can't figure out why.  This system is RHEL5,
 running Lustre 1.8.4.

 All of my OSTs look pretty much the same-

 read  | write
 pages per bulk r/w rpcs  % cum % |  rpcs  % cum %
 1:   88811  38  38   | 46375  17  17
 2:1497   0  38   | 7733   2  20
 4:1161   0  39   | 1840   0  21
 8:1168   0  39   | 7148   2  24
 16:922   0  40   | 3297   1  25
 32:979   0  40   | 7602   2  28
 64:   1576   0  41   | 9046   3  31
 128:  7063   3  44   | 16284   6  37
 256:129282  55 100   | 162090  62 100


 read  | write
 disk fragmented I/Os   ios   % cum % |  ios   % cum %
 0:   51181  22  22   |0   0   0
 1:   45280  19  42   | 82206  31  31
 2:   16615   7  49   | 29108  11  42
 3:3425   1  50   | 17392   6  49
 4:  110445  48  98   | 129481  49  98
 5:1661   0  99   | 2702   1  99

 read  | write
 disk I/O size  ios   % cum % |  ios   % cum %
 4K:  45889   8   8   | 56240   7   7
 8K:   3658   0   8   | 6416   0   8
 16K:  7956   1  10   | 4703   0   9
 32K:  4527   0  11   | 11951   1  10
 64K:114369  20  31   | 134128  18  29
 128K: 5095   0  32   | 17229   2  31
 256K: 7164   1  33   | 30826   4  35
 512K:   369512  66 100   | 465719  64 100

 Oddly, there's no 1024K row in the I/O size table...


 ...and these seem small to me as well, but I can't seem to change them.
 Writing new values to either doesn't change anything.

 # cat /sys/block/sdb/queue/max_hw_sectors_kb
 320
 # cat /sys/block/sdb/queue/max_sectors_kb
 320

 Hardware in question is DELL PERC 6/E and DELL PERC H800 RAID
 controllers, with MD1000 and MD1200 arrays, respectively.


 Any clues on where I should look next?

 Thanks,

 Kevin

 Kevin Hildebrand
 University of Maryland, College Park
 Office of Information Technology
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss




___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre 1.8.4 - Local mount of ost for backup purposes, fs type ldiskfs or ext4?

2011-05-11 Thread Kevin Van Maren
Well, that's the opposite problem of Bug 24398.

Are you sure you are using the ext4-based ldiskfs?

Kevin


On May 11, 2011, at 4:23 PM, Jeff Johnson jeff.john...@aeoncomputing.com 
  wrote:

 Greetings,

 I am doing a local mount of a 8TB ost device in a Lustre 1.8.4
 installation. The ost was built with a backfstype of ldiskfs.

 When attempting the local mount:

mount -t ldiskfs /dev/sdc /mnt/save/ost

 I get:

mount: wrong fs type, bad option, bad superblock on /dev/sdt,
missing codepage or other error

 I am able to mount the same block device as ext4, just not as  
 ldiskfs. I
 need to be able to mount as ldiskfs to get access to the extended
 attributes and back them up. Is this still the case with the ext4
 extensions for Lustre 1.8.4? I am able to mount read-only as ext4 but
 any attempt at reading the extended attributes with getfattr fails.

 Thanks,

 --Jeff

 -- 
 --
 Jeff Johnson
 Manager
 Aeon Computing

 jeff.john...@aeoncomputing.com
 www.aeoncomputing.com
 t: 858-412-3810 x101   f: 858-412-3845
 m: 619-204-9061

 4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Fragmented I/O

2011-05-11 Thread Kevin Van Maren
You didn't say, but I think they are LSI-based: are you using the mptsas 
driver with the PERC cards?  Which driver version?

First, max_sectors_kb should normally be set to a power of 2 number, 
like 256, over an odd size like 320.  This number should also match the 
native raid size of the device, to avoid read-modify-write cycles.  (See 
Bug 22886 on why not to make it  1024 in general).

See Bug 17086 for patches to increase the max_sectors_kb limitation for 
the mptsas driver to 1MB, or the true hardware maximum, rather than a 
driver limit; however, the hardware may still be limited to sizes  1MB.

Also, to clarify the sizes: the smallest bucket = transfer_size is the 
one incremented, so a 320KB IO increments the 512KB bucket.  Since your 
HW says it can only do a 320KB IO, there will never be a 1MB IO.

You may want to instrument your HBA driver to see what is going on (ie, 
why the max_hw_sectors_kb is  1024).

Kevin


Kevin Hildebrand wrote:
 Hi, I'm having some performance issues on my Lustre filesystem and it 
 looks to me like it's related to I/Os getting fragmented before being 
 written to disk, but I can't figure out why.  This system is RHEL5, 
 running Lustre 1.8.4.

 All of my OSTs look pretty much the same-

 read  | write
 pages per bulk r/w rpcs  % cum % |  rpcs  % cum %
 1:   88811  38  38   | 46375  17  17
 2:1497   0  38   | 7733   2  20
 4:1161   0  39   | 1840   0  21
 8:1168   0  39   | 7148   2  24
 16:922   0  40   | 3297   1  25
 32:979   0  40   | 7602   2  28
 64:   1576   0  41   | 9046   3  31
 128:  7063   3  44   | 16284   6  37
 256:129282  55 100   | 162090  62 100


 read  | write
 disk fragmented I/Os   ios   % cum % |  ios   % cum %
 0:   51181  22  22   |0   0   0
 1:   45280  19  42   | 82206  31  31
 2:   16615   7  49   | 29108  11  42
 3:3425   1  50   | 17392   6  49
 4:  110445  48  98   | 129481  49  98
 5:1661   0  99   | 2702   1  99

 read  | write
 disk I/O size  ios   % cum % |  ios   % cum %
 4K:  45889   8   8   | 56240   7   7
 8K:   3658   0   8   | 6416   0   8
 16K:  7956   1  10   | 4703   0   9
 32K:  4527   0  11   | 11951   1  10
 64K:114369  20  31   | 134128  18  29
 128K: 5095   0  32   | 17229   2  31
 256K: 7164   1  33   | 30826   4  35
 512K:   369512  66 100   | 465719  64 100

 Oddly, there's no 1024K row in the I/O size table...


 ...and these seem small to me as well, but I can't seem to change them. 
 Writing new values to either doesn't change anything.

 # cat /sys/block/sdb/queue/max_hw_sectors_kb
 320
 # cat /sys/block/sdb/queue/max_sectors_kb
 320

 Hardware in question is DELL PERC 6/E and DELL PERC H800 RAID 
 controllers, with MD1000 and MD1200 arrays, respectively.


 Any clues on where I should look next?

 Thanks,

 Kevin

 Kevin Hildebrand
 University of Maryland, College Park
 Office of Information Technology
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre filesystem hangs when reading large files

2011-04-20 Thread Kevin Van Maren
Chris Exton wrote:

 Hello,

 We are currently using lustre 1.8.1.1 and using kernel version 
 2.6.18_128.7.1.el5_lustre.

 We are experiencing problems when performing reads of large files from 
 my lustre filesystem, small reads are not affected.

 The read process hangs and the following message is reported in 
 /var/log/messages:

 Feb 22 15:59:38 leopard kernel: LustreError: 11-0: an error occurred 
 while communicating with 192.168.13.200@o2ib. The obd_ping operation 
 failed with -107

 Feb 22 15:59:38 leopard kernel: Lustre: 
 lustre-OST-osc-81067e0eac00: Connection to service 
 lustre-OST via nid 192.168.13.200@o2ib was lost; in progress 
 operations using this service will wait for recovery to complete.

 Feb 22 15:59:38 leopard kernel: LustreError: 
 6811:0:(import.c:939:ptlrpc_connect_interpret()) lustre-OST_UUID 
 went back in time (transno 476754140074 was previously committed, 
 server now claims 0)! See 
 https://bugzilla.lustre.org/show_bug.cgi?id=9646

 Feb 22 15:59:38 leopard kernel: LustreError: 167-0: This client was 
 evicted by lustre-OST; in progress operations using this service 
 will fail.

 Feb 22 15:59:38 leopard kernel: Lustre: 
 lustre-OST-osc-81067e0eac00: Connection restored to service 
 lustre-OST using nid 192.168.13.200@o2ib.

 Feb 22 15:59:38 leopard kernel: LustreError: 
 17592:0:(lov_request.c:196:lov_update_enqueue_set()) enqueue objid 
 0x18f87222 subobj 0x4d0c9f on OST idx 0: rc -5

 I have checked the bugzilla report but we have not had a disk crash 
 and the system was not restarted. Could this be an underlying hardware 
 problem that’s not getting logged?


Could be a hardware issue with your network, but not your disk: it looks 
like a network failure resulted in client eviction (server unable to 
contact client, so it was evicted), which resulted in the back in time 
message when it reconnected (and could not complete outstanding IOs -- 
pending writes, ie from client cache, get dropped on the floor when 
evicted). See https://bugzilla.lustre.org/show_bug.cgi?id=21681

 Any additional help on this matter would be much appreciated.

 Kind Regards

 Chris


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] poor ost write performance.

2011-04-20 Thread Kevin Van Maren
First guess is the increased memory pressure caused by the Lustre 1.8  
read cache.  Many times slow messages are caused by memory  
allocatons taking a long time.

You could try disabling the read cache and see if that clears up the  
slow messages.

Kevin


On Apr 20, 2011, at 4:29 AM, James Rose james.r...@framestore.com  
wrote:

 Hi

 We have been experiencing degraded performance for a few days on a  
 fresh install of lustre 1.8.5 (on RHEL5 using sun ext4 rpms).  The  
 initial bulk load of the data will be fine but once in use for a  
 while writes become very slow to individual ost.  This will block io  
 for a few minutes and then carry on as normal.  The slow writes will  
 then move to another ost.  This can be seen in iostat and many slow  
 IO messages will be seen in the logs (example included)

 The osts are between 87 90 % full.  Not ideal but has not caused any  
 issues running 1.6.7.2 on the same hardware.

 The osts are RAID6 on external raid chassis (Infortrend).  Each ost  
 is 5.4T (small).  The server is Dual AMD (4 cores). 16G Ram. Qlogic  
 FC HBA.

 I mounted the osts as ldiskfs and tried a few write tests.  These  
 also show the same behaviour.

 While the write operation is blocked there will be hundreds of read  
 tps and a very small kb/s read from the raid but now writes.  As  
 soon as this completes writes will go through at a more expected  
 speed.

 Any idea what is going on?

 Many thanks

 James.

 Example error messages:

 Apr 20 04:53:04 oss5r-mgmt kernel: LustreError: dumping log to /tmp/ 
 lustre-log.1303271584.3935
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow quota  
 init 286s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal  
 start 39s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 39 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow  
 brw_start 39s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 38 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal  
 start 133s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow  
 brw_start 133s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal  
 start 236s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex  
 40s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 2 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 6 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex  
 277s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow  
 direct_io 286s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 3 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal  
 start 285s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous  
 similar message
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow  
 commitrw commit 285s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous  
 similar message
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow parent  
 lock 236s due to heavy IO load


 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] poor ost write performance.

2011-04-20 Thread Kevin Van Maren
Yes, difficulty finding free disk space can also be a problem, but I 
could not recall big changes in
how that worked since 1.6, other than memory pressure from the read 
cache pushing out the bitmaps.
See http://jira.whamcloud.com/browse/LU-15

Kevin


James Rose wrote:
 Hi Kevin,

 Thanks for the suggestion.  I will try this out.  

 For the moment it seems that it may be disk space related.  I have
 removed some data from the file system.  Performance returned to where I
 would expect it to be as space freed up (currently at 83% full).  Since
 free space I have seen two messages on an OSS where the number of
 threads is tuned to the amount of RAM in the host and six on an OSS that
 has the number of threads set higher than it should.  This is a much
 better situation than the steady stream I was experiencing last night.
 Maybe disabling the read cache will remove the last few.

 I am still very curious what the rapid small reads seen when writing are
 as this showed up while mounted ldiskfs so not doing regular lustre
 operations at all.

 Thanks again for your help,

 James.


 On Wed, 2011-04-20 at 08:48 -0300, Kevin Van Maren wrote:
   
 First guess is the increased memory pressure caused by the Lustre 1.8  
 read cache.  Many times slow messages are caused by memory  
 allocatons taking a long time.

 You could try disabling the read cache and see if that clears up the  
 slow messages.

 Kevin


 On Apr 20, 2011, at 4:29 AM, James Rose james.r...@framestore.com  
 wrote:

 
 Hi

 We have been experiencing degraded performance for a few days on a  
 fresh install of lustre 1.8.5 (on RHEL5 using sun ext4 rpms).  The  
 initial bulk load of the data will be fine but once in use for a  
 while writes become very slow to individual ost.  This will block io  
 for a few minutes and then carry on as normal.  The slow writes will  
 then move to another ost.  This can be seen in iostat and many slow  
 IO messages will be seen in the logs (example included)

 The osts are between 87 90 % full.  Not ideal but has not caused any  
 issues running 1.6.7.2 on the same hardware.

 The osts are RAID6 on external raid chassis (Infortrend).  Each ost  
 is 5.4T (small).  The server is Dual AMD (4 cores). 16G Ram. Qlogic  
 FC HBA.

 I mounted the osts as ldiskfs and tried a few write tests.  These  
 also show the same behaviour.

 While the write operation is blocked there will be hundreds of read  
 tps and a very small kb/s read from the raid but now writes.  As  
 soon as this completes writes will go through at a more expected  
 speed.

 Any idea what is going on?

 Many thanks

 James.

 Example error messages:

 Apr 20 04:53:04 oss5r-mgmt kernel: LustreError: dumping log to /tmp/ 
 lustre-log.1303271584.3935
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow quota  
 init 286s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal  
 start 39s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 39 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow  
 brw_start 39s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 38 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal  
 start 133s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow  
 brw_start 133s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal  
 start 236s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex  
 40s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 2 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 6 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex  
 277s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow  
 direct_io 286s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 3 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal  
 start 285s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous  
 similar message
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow  
 commitrw commit 285s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous  
 similar message
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow parent  
 lock 236s due to heavy IO load


 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   
 ___
 Lustre-discuss mailing list

Re: [Lustre-discuss] poor ost write performance.

2011-04-20 Thread Kevin Van Maren
Yes, difficulty finding free disk space can also be a problem, but I 
could not recall big changes in
how that worked since 1.6, other than memory pressure from the read 
cache pushing out the bitmaps.
See http://jira.whamcloud.com/browse/LU-15

Kevin


James Rose wrote:
 Hi Kevin,

 Thanks for the suggestion.  I will try this out.  

 For the moment it seems that it may be disk space related.  I have
 removed some data from the file system.  Performance returned to where I
 would expect it to be as space freed up (currently at 83% full).  Since
 free space I have seen two messages on an OSS where the number of
 threads is tuned to the amount of RAM in the host and six on an OSS that
 has the number of threads set higher than it should.  This is a much
 better situation than the steady stream I was experiencing last night.
 Maybe disabling the read cache will remove the last few.

 I am still very curious what the rapid small reads seen when writing are
 as this showed up while mounted ldiskfs so not doing regular lustre
 operations at all.

 Thanks again for your help,

 James.


 On Wed, 2011-04-20 at 08:48 -0300, Kevin Van Maren wrote:
   
 First guess is the increased memory pressure caused by the Lustre 1.8  
 read cache.  Many times slow messages are caused by memory  
 allocatons taking a long time.

 You could try disabling the read cache and see if that clears up the  
 slow messages.

 Kevin


 On Apr 20, 2011, at 4:29 AM, James Rose james.r...@framestore.com  
 wrote:

 
 Hi

 We have been experiencing degraded performance for a few days on a  
 fresh install of lustre 1.8.5 (on RHEL5 using sun ext4 rpms).  The  
 initial bulk load of the data will be fine but once in use for a  
 while writes become very slow to individual ost.  This will block io  
 for a few minutes and then carry on as normal.  The slow writes will  
 then move to another ost.  This can be seen in iostat and many slow  
 IO messages will be seen in the logs (example included)

 The osts are between 87 90 % full.  Not ideal but has not caused any  
 issues running 1.6.7.2 on the same hardware.

 The osts are RAID6 on external raid chassis (Infortrend).  Each ost  
 is 5.4T (small).  The server is Dual AMD (4 cores). 16G Ram. Qlogic  
 FC HBA.

 I mounted the osts as ldiskfs and tried a few write tests.  These  
 also show the same behaviour.

 While the write operation is blocked there will be hundreds of read  
 tps and a very small kb/s read from the raid but now writes.  As  
 soon as this completes writes will go through at a more expected  
 speed.

 Any idea what is going on?

 Many thanks

 James.

 Example error messages:

 Apr 20 04:53:04 oss5r-mgmt kernel: LustreError: dumping log to /tmp/ 
 lustre-log.1303271584.3935
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow quota  
 init 286s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal  
 start 39s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 39 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow  
 brw_start 39s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 38 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal  
 start 133s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow  
 brw_start 133s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal  
 start 236s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex  
 40s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 2 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 6 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex  
 277s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow  
 direct_io 286s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 3 previous  
 similar messages
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal  
 start 285s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous  
 similar message
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow  
 commitrw commit 285s due to heavy IO load
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous  
 similar message
 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow parent  
 lock 236s due to heavy IO load


 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   
 ___
 Lustre-discuss mailing list

Re: [Lustre-discuss] Optimal stratgy for OST distribution

2011-03-31 Thread Kevin Van Maren
It used to be that multi-stripe files were created with sequential OST 
indexes.  It also used to be that OST indexes were sequentially assigned 
to newly-created files.
As Lustre now adds greater randomization, the strategy for assigning 
OSTs to OSS nodes (and storage hardware, which often limits the 
aggregate performance of multiple OSTs) is less important.

While I have normally gone with a, b can make it easier to remember 
where OSTs are located, and also keep a uniform convention if the 
storage system is later grown.

Kevin


Heckes, Frank wrote:
 Hi all,

 sorry if this question has been answered before.

 What is the optimal 'strategy' assigning OSTs to OSS nodes:

 -a- Assign OST via round-robin to the OSS
 -b- Assign in consecutive order (as long as the backend storage provides
 enought capacity for iops and bandwidth)
 -c- Something 'in-between' the 'extremes' of -a- and -b-

 E.g.:

 -a- OSS_1   OSS_2   OST_3
   |_  |_  |_
 OST_1   OST_2   OST_3
 OST_4   OST_5   OST_6
 OST_7   OST_8   OST_9

 -b- OSS_1   OSS_2   OST_3
   |_  |_  |_
 OST_1   OST_4   OST_7
 OST_2   OST_5   OST_8
 OST_3   OST_6   OST_9

 I thought -a- would be best for task-local (each task write to own
 file) and single file (all task write to single file) I/O since its like
 a raid-0 approach used disk I/O (and SUN create our first FS this way).
 Does someone made any systematic investigations which approach is best
 or have some educated opinion?
 Many thanks in advance.
 BR

 -Frank Heckes

 
 
 Forschungszentrum Juelich GmbH
 52425 Juelich
 Sitz der Gesellschaft: Juelich
 Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
 Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
 Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
 Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
 Prof. Dr. Sebastian M. Schmidt
 
 

 Besuchen Sie uns auf unserem neuen Webauftritt unter www.fz-juelich.de
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Problem with lustre 2.0.0.1, ext3/4 and big OSTs (8Tb)

2011-03-15 Thread Kevin Van Maren
Joan J. Piles wrote:
 Hi,

 We are trying to set up a lustre 2.0.0.1 (the most recent one 
 downladable from the offiecial site) installation. We plan to have some 
 big OSTs (~ 12Tb), using ScientificLinux 5.5 (which should be a RHEL 
 clone for all purposes).

 However, when we try to format the OSTs, we get the following error:

   
 [root@oss01 ~]# mkfs.lustre --ost --fsname=extra 
 --mgsnode=172.16.4.4@tcp0 --mkfsoptions '-i 262144 -E 
 stride=32,stripe_width=192 ' /dev/sde

Permanent disk data:
 Target: extra-OST
 Index:  unassigned
 Lustre FS:  extra
 Mount type: ldiskfs
 Flags:  0x72
   (OST needs_index first_time update )
 Persistent mount opts: errors=remount-ro,extents,mballoc
 Parameters: mgsnode=172.16.4.4@tcp

 checking for existing Lustre data: not found
 device size = 11427830MB
 formatting backing filesystem ldiskfs on /dev/sde
 target name  extra-OST
 4k blocks 2925524480
 options   -i 262144 -E stride=32,stripe_width=192  -J size=400 
 -I 256 -q -O dir_index,extents,uninit_bg -F
 mkfs_cmd = mke2fs -j -b 4096 -L extra-OST -i 262144 -E 
 stride=32,stripe_width=192  -J size=400 -I 256 -q -O 
 dir_index,extents,uninit_bg -F /dev/sde 2925524480
 mkfs.lustre: Unable to mount /dev/sde: Invalid argument

 mkfs.lustre FATAL: failed to write local files
 mkfs.lustre: exiting with 22 (Invalid argument)
 


 In the dmesg log, we find the following line:

   
 LDISKFS-fs does not support filesystems greater than 8TB and can cause 
 data corruption.Use force_over_8tb mount option to override.
 

 After some investigation, we find it is related to the use of ext3 
 instead of ext4, 

Correct.

 even though we should be using ext4, proven by the fact 
 that the file systems created are actually ext4:

   
 [root@oss01 ~]# file -s /dev/sde
 /dev/sde: Linux rev 1.0 ext4 filesystem data (extents) (large files)
 

No, these are ldiskfs filesystems.  ext3+ldiskfs looks a bit like ext4 
(ext4 is largely based on the
enhancements done for Lustre's ldiskfs), but is not the same as 
ext4+ldiskfs.  In particular, file system
size is limited to 8TB, not 16TB.

 Further, we made a test with an ext3 filesystem in the same machine, and 
 the difference is found:

   
 [root@oss01 ~]# file -s /dev/sda1
 /dev/sda1: Linux rev 1.0 ext3 filesystem data (large files)
 

 Everything we found in the net about this problem seems to refer to 
 lustre 1.8.5. However, we would not expect such a regression in lustre 
 2. Is this actually a problem with lustre 2? Has ext4 to be enabled 
 either at compile time or with a parameter somewhere (we found no 
 documentation about it)?
   

Lustre 2.0 did not enable ext4 by default, due to known issues.  You can 
rebuild the Lustre server,
with --enable-ext4 on the configure line, to enable it.  But if you 
are going to use 12TB LUNs,
you should either sick with v1.8.5 (stable), or pull a newer version 
from git (experimental).

Kevin

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help

2011-03-15 Thread Kevin Van Maren
Ashok nulguda wrote:
 Dear All,

 How to forcefully shutdown the luster service from client and OST and 
 MDS server when IO are opening.

For the servers, you can just umount them.  There will not be any file 
system corruption, but files will not have the latest data -- the cache 
on the clients will not be written to disk (unless recovery happens -- 
restart the servers without having rebooted the clients).  In an 
emergency, this is normally all you have time to do before shutting down 
the system.

To unmount clients, not only can there not be any IO, you also need to 
first kill every process that has an open file on Lustre.  lsof can be 
useful here if you don't want to do a full shutdown, but in many 
environments killing non-system processes is enough.

Normally you'd want to shutdown all the clients, and then the servers.

Kevin

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] clients gets EINTR from time to time

2011-02-24 Thread Kevin Van Maren
No, in case of an eviction or IO errors, EIO is returned to the 
application, not EINTR.

Kevin


DEGREMONT Aurelien wrote:
 Hello

  From my understanding, Lustre can return EINTR for some I/O error cases.
 I think that when a client gets evicted in the middle of one of its RPC, 
 it can returns EINTR to the caller.
 Is this can explain your issue?

 Can your verify your clients where not evicted at the same time?

 Aurélien

 Francois Chassaing a écrit :
   
 OK, thanks it makes it more clear.
 I indeed messed up my mind (and words) between signals and error return 
 codes.
 I did understood that the write()/pwrite() system was returning the EINTR 
 error code because it received a signal, but I supposed that the signal was 
 sent because of an error condition somewhere in the FS. 
 This is where I now think I'm wrong. 
  
 As for your questions :
 - I have to mention that I always had had this issue, and this is why I've 
 upgraded from 1.8.4 to 1.8.5, hoping this would solve it.
 - I will try to have that SA_RESTART flag set in the app... if I can find 
 where the signal handler is set.
 - How can I see that lustre is returning EINTR for any other reason ? As I 
 said no logs shows nothing neither on MDS or OSSs, but I didn't go through 
 examining lctl debug_kernel yet... which I'm going to do right away...

 my last question is : how can I tell which signal I am receiving ? because 
 my app doesn't say, it just dumps outs the write/pwrite error code. 
 And if there is no signal handler, then it should follow the standard 
 actions (as of man 7 signal). On the other hand, my app does not stop or 
 dump core, and is not ignored, so it has to be handled in the code. Correct 
 me if I'm wrong...

 At that point, you realize that I didn't write the app, nor am I a good 
 Linux guru ;-)

 Tnaks a lot.

 weborama lineFrançois Chassaing Directeur Technique - CTO 

 - Mail Original -
 De: Ken Hornstein k...@cmf.nrl.navy.mil
 À: Francois Chassaing f...@weborama.com
 Cc: lustre-discuss@lists.lustre.org
 Envoyé: Jeudi 24 Février 2011 15h54:24 GMT +01:00 Amsterdam / Berlin / Berne 
 / Rome / Stockholm / Vienne
 Objet: Re: [Lustre-discuss] clients gets EINTR from time to time

   
 
 OK, the app is used to deal with standard disks, that is why it is not
 handling the EINTR signal propoerly.
 
   
 I think you're misunderstanding what a signal is in the Unix sense.

 EINTR isn't a signal; it's a return code from the write() system call
 that says, Hey, you got a signal in the middle of this write() call
 and it didn't complete.  It doesn't mean that there was an error
 writing the file; if that was happening, you'd get a (presumably
 different) error code.  Signals can be sent by the operating system,
 but those signals are things like SIGSEGV, which basically means, you're
 program screwed up.  Programs can also send signals to each other,
 with kill(2) and the like.

 Now, NORMALLY systems calls like write() are interrupted by signals
 when you're writing to slow devices, like network sockets.  According
 to the signal(7) man page, disks are not normally considered slow
 devices, so I can understand the application not being used to handling
 this.  And you know, now that I think about it I'm not even sure that
 network filesystems SHOULD allow I/O system calls to be interrupted by
 signals ... I'd have to think more about it.

 I suspect what happened is that something changed between 1.8.5 and the
 previous version of Lustre that you were using that allowed some operations
 to be interruptable by signals.  Some things to try:

 - Check to see if you are, in fact, receiving a signal in your application
   and Lustre isn't returning EINTR for some other reason.
 - If you are receiving a signal, when you set the signal handler for it
   you could use the SA_RESTART flag to restart the interrupted I/O; I think
   that would make everything work like it did before.

 --Ken
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   
 

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OST threads

2011-02-24 Thread Kevin Van Maren
However, I don't think you can decrease the number of running threads.
See https://bugzilla.lustre.org/show_bug.cgi?id=22417 (and also 
https://bugzilla.lustre.org/show_bug.cgi?id=22516 )

Kevin


Mervini, Joseph A wrote:
 Cool! Thank you Johann.
 

 Joe Mervini
 Sandia National Laboratories
 High Performance Computing
 505.844.6770
 jame...@sandia.gov



 On Feb 24, 2011, at 11:05 AM, Johann Lombardi wrote:

   
 On Thu, Feb 24, 2011 at 10:48:32AM -0700, Mervini, Joseph A wrote:
 
 Quick question: Has runtime modification of the number of OST threads been 
 implemented in Lustre-1.8.3?
   
 Yes, see bugzilla ticket 18688. It was landed in 1.8.1.

 Cheers,
 Johann

 -- 
 Johann Lombardi
 Whamcloud, Inc.
 www.whamcloud.com

 


 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Compiling Lustre 2 on SLES10

2011-02-21 Thread Kevin Van Maren
Yes, that is what Oracle had announced in the roadmap.

SLES servers are still supported on Lustre 1.8.x, but Oracle announced 
plans to not support them with Lustre 2.x.  Given the similarities 
between the RHEL6 and SLES11 kernels, I am sure someone could bring SLES 
support back when RHEL6 is supported, if enough people were willing to 
pay for it.

Kevin


Alvaro Aguilera wrote:
 Does it mean that Lustre is completely dropping server support for SLES?


 On Mon, Feb 21, 2011 at 4:58 PM, Johann Lombardi joh...@whamcloud.com 
 mailto:joh...@whamcloud.com wrote:

 On Mon, Feb 21, 2011 at 04:42:45PM +0100, Alvaro Aguilera wrote:
  inside that directory there are only files for RedHat5 and SLES11.
 
  Is SLES10 still supported?

 Yes, but only on the client side:
 http://wiki.lustre.org/index.php/Lustre_2.0#Lustre_2.0_Matrix

 Cheers,
 Johann

 --
 Johann Lombardi
 Whamcloud, Inc.
 www.whamcloud.com http://www.whamcloud.com


 

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Kernel Panic error after lustre 2.0 installation

2011-02-17 Thread Kevin Van Maren
Yep.  All you have to do is rebuild the driver for the Lustre kernel.

First, bring the system back up with the non-Lustre kernel.



See the bottom of the readme:

# cd /usr/src/linux/drivers/scsi/arcmsr
(suppose /usr/src/linux is the soft-link for 
/usr/src/kernel/2.6.23.1-42.fc8-i386)
# make -C /lib/modules/`uname -r`/build CONFIG_SCSI_ARCMSR=m 
SUBDIRS=$PWD modules
# insmod arcmsr.ko

Except instead of uname -r substitute the lustre kernel's 'uname -r', 
as you want to build for the Lustre kernel.  Be sure you have the Lustre 
kernel-devel RPM installed.

Note that the insmod will not work (you already have it for the 
running kernel, and the one you built for the Lustre kernel will not 
work).  You will need to rebuild the initrd for the Lustre kernel (see 
the other instructions in the readme, using the Lustre kernel).

Kevin


Arya Mazaheri wrote:
 The driver name is arcmsr.ko and I extracted it from driver.img 
 included in RAID controller's CD. The following text file may clarify 
 better:

 ftp://areca.starline.de/RaidCards/AP_Drivers/Linux/DRIVER/RedHat/FedoraCore/Redhat-Fedora-core8/1.20.0X.15/Intel/readme.txt

 Please tell me, if you need more information about this issue...

 On Thu, Feb 17, 2011 at 11:33 PM, Brian J. Murrell 
 br...@whamcloud.com mailto:br...@whamcloud.com wrote:

 On Thu, 2011-02-17 at 23:26 +0330, Arya Mazaheri wrote:
  Hi there,

 Hi,

  Unable to access resume device (LABEL=SWAP-sda3)
  mount: could not find filesystem 'dev/root'
  setuproot: moving /dev failed: No such file or directory
  setuproot: error mounting /proc: No such file or directory
  setuproot: error mounting /sys: No such file or directory
  swirchroot: mount failed: No such file or directory
  Kernel Panic - not syncing: Attempted to kill init!
 
  I have no problem with the original kernel installed by centos. I
  guessed this may be related to RAID controller card driver which may
  not loaded by the patched lustre kernel.

 That seems like a reasonable conclusion given the information
 available.

  so I have added the driver into the initrd.img file.

 Where did you get the driver from?  What is the name of the driver?

  But it didn't solve the problem.

 Depending on where it came from, yes, it might not.

  Should I install the lustre by building the source?

 That may be required, but not necessarily required.  We need more
 information.

 b.



 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 mailto:Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss


 

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre client error

2011-02-17 Thread Kevin Van Maren
To figure out which OST is which, use e2label /dev/sdX (or e2label 
/dev/mapper/mpath7) which will print the OST index in hex.

If clients run out of space, but there is space left, see Bug 22755 
(mostly fixed in Lustre 1.8.4).

Lustre assigns the OST index at file creation time.  Lustre will avoid 
full OSTs, but once a file is created any growth must be accommodated by 
the initial OST assignment(s).  Deactivating the OST on the MDS will 
prevent new allocations, but they shouldn't be happening anyway.

You can copy/rename some large files to put them on another OST which 
will free up space on the full OST (move will not allocate new space, 
just change the directory name).

Kevin



Jagga Soorma wrote:
 This OST is 100% now with only 12GB remaining and something is 
 actively writing to this volume.  What would be the appropriate thing 
 to do in this scenario?  If I set this to read only on the mds then 
 some of my clients start hanging up.

 Should I be running lfs find -O OST_UID /lustre and then move the 
 files out of this filesystem and re-add them back?  But then there is 
 no gurantee that they will not be written to this specific OST.

 Any help would be greately appreciated.

 Thanks,
 -J

 On Tue, Feb 15, 2011 at 3:05 PM, Jagga Soorma jagg...@gmail.com 
 mailto:jagg...@gmail.com wrote:

 I might be looking at the wrong OST.  What is the best way to map
 the actual /dev/mapper/mpath[X] to what OST ID is used for that
 volume?

 Thanks,
 -J


 On Tue, Feb 15, 2011 at 3:01 PM, Jagga Soorma jagg...@gmail.com
 mailto:jagg...@gmail.com wrote:

 Also, it looks like the client is reporting a different %used
 compared to the oss server itself:

 client:
 reshpc101:~ # lfs df -h | grep -i 0007
 reshpcfs-OST0007_UUID  2.0T  1.7T202.7G   84%
 /reshpcfs[OST:7]

 oss:
 /dev/mapper/mpath72.0T  1.9T   40G  98%
 /gnet/lustre/oss02/mpath7

 Here is how the data seems to be distributed on one of the OSS's:
 --
 /dev/mapper/mpath52.0T  1.2T  688G  65%
 /gnet/lustre/oss02/mpath5
 /dev/mapper/mpath62.0T  1.7T  224G  89%
 /gnet/lustre/oss02/mpath6
 /dev/mapper/mpath72.0T  1.9T   41G  98%
 /gnet/lustre/oss02/mpath7
 /dev/mapper/mpath82.0T  1.3T  671G  65%
 /gnet/lustre/oss02/mpath8
 /dev/mapper/mpath92.0T  1.3T  634G  67%
 /gnet/lustre/oss02/mpath9
 --

 -J


 On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma
 jagg...@gmail.com mailto:jagg...@gmail.com wrote:

 I did deactivate this OST on the MDS server.  So how would
 I deal with a OST filling up?  The OST's don't seem to be
 filling up evenly either.  How does lustre handle a OST
 that is at 100%?  Would it not use this specific OST for
 writes if there are other OST available with capacity? 

 Thanks,
 -J


 On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger
 adil...@whamcloud.com mailto:adil...@whamcloud.com wrote:

 On 2011-02-15, at 12:20, Cliff White wrote:
  Client situation depends on where you deactivated
 the OST - if you deactivate on the MDS only, clients
 should be able to read.
 
  What is best to do when an OST fills up really
 depends on what else you are doing at the time, and
 how much control you have over what the clients are
 doing and other things.  If you can solve the space
 issue with a quick rm -rf, best to leave it online,
 likewise if all your clients are trying to bang on it
 and failing, best to turn things off. YMMV

 In theory, with 1.8 the full OST should be skipped for
 new object allocations, but this is not robust in the
 face of e.g. a single very large file being written to
 the OST that takes it from average usage to being full.

  On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma
 jagg...@gmail.com mailto:jagg...@gmail.com wrote:
  Hi Guys,
 
  One of my clients got a hung lustre mount this
 morning and I saw the following errors in my logs:
 
  --
  ..snip..
  Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0:
 an error occurred while communicating with
 10.0.250.47@o2ib3. The ost_write operation failed with -28
  Feb 15 09:38:07 reshpc116 kernel: LustreError:
 Skipped 4755836 previous similar messages
  Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0:
 an 

Re: [Lustre-discuss] 1GB throughput limit on OST (1.8.5)?

2011-01-27 Thread Kevin Van Maren
Normally if you are having a problem with write BW, you need to futz 
with the switch.  If you were having
problems with read BW, you need to futz with the server's config (xmit 
hash policy is the usual culprit).

Are you testing multiple clients to the same server?

Are you using mode 6 because you don't have bonding support in your 
switch?  I normally use 802.3ad mode,
assuming your switch supports link aggregation.


I was bonding 2x1Gb links for Lustre back in 2004.  That was before 
BOND_XMIT_POLICY_LAYER34
was in the kernel, so I had to hack the bond xmit hash (with multiple 
NICs standard, layer2 hashing does not
produce a uniform distribution, and can't work if going through a router).

Any one connection (socket or node/node connection) will use only one 
gigabit link.  While it is possible
to use two links using round-robin, that normally only helps for client 
reads (server can't choose which link to
receive data, the switch picks that), and has the serious downside of 
out-of-order packets on the TCP stream.

[If you want clients to have better client bandwidth for a single file, 
change your default stripe count to 2, so it
will hit two different servers.]

Kevin


David Merhar wrote:
 Sorry - little b all the way around.

 We're limited to 1Gb per OST.

 djm



 On Jan 27, 2011, at 7:48 AM, Balagopal Pillai wrote:

   
 I guess you have two gigabit nics bonded in mode 6 and not two 1GB  
 nics?
 (B-Bytes, b-bits) The max aggregate throughput could be about 200MBps
 out of the 2 bonded nics. I think the mode 0 bonding works only with
 cisco etherchannel or something similar on the switch side. Same with
 the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max
 throughout. Maybe you could also see the max read and write  
 capabilities
 of the raid controller other than just the network. When testing with
 dd, some of the data remains as dirty data till its flushed into the
 disk. I think the default background ratio is 10% for rhel5 which  
 would
 be sizable if your oss have lots of ram. There is chance of lockup of
 the oss once it hits the dirty_ratio limit,which is 40% by default.  
 So a
 bit more aggressive flush to disk by lowering the background_ratio  
 and a
 bit more headroom before it hits the dirty_ratio is generally  
 desirable
 if your raid controller could keep up with it. So with your current
 setup, i guess you could get a max of 400MBps out of both OSS's if  
 they
 both have two 1Gb nics in them. Maybe if you have one of the switches
 from Dell that has 4 10Gb ports in them (their powerconnect 6248),  
 10Gb
 nics for your OSS's might be a cheaper way to increase the aggregate
 performance. I think over 1GBps from a client is possible in cases  
 where
 you use infiniband and rdma to deliver data.


 David Merhar wrote:
 
 Our OSS's with 2x1GB NICs (bonded) appear limited to 1GB worth of
 write throughput each.

 Our setup:
 2 OSS serving 1 OST each
 Lustre 1.8.5
 RHEL 5.4
 New Dell M610's blade servers with plenty of CPU and RAM
 All SAN fibre connections are at least 4GB

 Some notes:
 - A direct write (dd) from a single OSS to the OST gets 4GB, the  
 OSS's
 fibre wire speed.
 - A single client will get 2GB of lustre write speed, the client's
 ethernet wire speed.
 - We've tried bond mode 6 and 0 on all systems.  With mode 6 we will
 see both NICs on both OSSs receiving data.
 - We've tried multiple OSTs per OSS.

 But 2 clients writing a file will get 2GB of total bandwidth to the
 filesystems.  We have been unable to isolate any particular resource
 bottleneck.  None of the systems (MDS, OSS, or client) seem to be
 working very hard.

 The 1GB per OSS threshold is so consistent, that it almost appears by
 design - and hopefully we're missing something obvious.

 Any advice?

 Thanks.

 djm



 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Support for 2.6.32 kernel.org Kernel in Lustre 1.8.5

2010-12-08 Thread Kevin Van Maren
Client support through 2.6.32 (vanilla) is in v1.8.5.  Looks like one 
page missed getting updated.

http://wiki.lustre.org/index.php/Lustre_Release_Information#Lustre_Support_Matrix

Kevin


Nirmal Seenu wrote:
 I have a quick question whether patchless clients for kernel.org kernel 
 2.6.32 is officially supported under Lustre 1.8.5 or if need to include any 
 patches.

 In the lustre source tree lustre/ChangeLog say that 2.6.32 is supported while 
 the wiki page(http://wiki.lustre.org/index.php/Change_Log_1.8) says only 
 kernels up to 2.6.30 are officially supported for patchless clients.

 Note: I am able to build the patchless clients cleanly with 2.6.32.20 kernel 
 and OFED 1.5.2.

 Thanks
 Nirmal
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] manual OST failover for maintenance work?

2010-12-07 Thread Kevin Van Maren
Cliff White wrote:
 On 12/06/2010 09:57 AM, Adeyemi Adesanya wrote:
   
 Hi.

 We have pairs of OSS nodes hooked up to shared storage arrays
 containing OSTs but we have not enabled any failover settings yet. Now
 we need to perform maintenance work on an OSS and we would like to
 minimize Lustre downtime. Can I use tunefs.lustre to specify the OSS
 failover NID for an existing OST? I assume i'll have to take the OST
 offline to make this change. Will clients that have Lustre mounted
 pick up this change or will all clients have to remount? I should
 mention that we are running Lustre 1.8.2.
 


 Yes, see the Lustre Manual for details.
 cliffw
   

Should be something like this for an OST:
# tunefs.lustre --writeconf --erase-params --mgsnode=10.0@o2ib 
--mgsnode=10.0@o2ib --param=failover.node=10.0@o2ib /dev/ost0

Do MGS first (if not already done and it will have failover).  Dedicated 
mgs should not have to specify mgs, just the failover.
For MDT, would probably have to also have 
--param=mdt.group_upcall=/usr/sbin/l_getgroups

Note that you must add the failover NID (ie, do the tunefs and the first 
mount) on the _primary_ (non-failover) node.

Lustre machines get the NID information for MDT/OST devices from the MGS 
at mount time.
There is no callback mechanism to notify of changes to the NIDs, so yes, 
clients would
have to re-mount the file system to be able to use the failover NIDs.

Kevin

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Delete ost

2010-11-22 Thread Kevin Van Maren
Wang Yibin wrote:
 Hello,

 在 2010-11-19,上午3:21, Thomas Johansson 写道:

 I am not sure I understand - Do you have multiple filesystems
 sharing the
 same MGS?

 Yes 5 filesystems on 4 OSS:s and 2 MDS in active/passive failover.
 Some 100 TB of space in total.

 Probably you misunderstood me. You seems to be using 1
 filesystem(MDSx2/OSSx4) with 5 clients.
 Making 5 lustre filesystems out of 4OSS/2MDS is a mission-impossible.

No, while it is not often done, there is nothing to prevent 5 Lustre
file systems from running on 4 OSS nodes and 2 MDS nodes.
In addition to the MGS, each file system needs one MDT and 1 or more
OSTs. An OSS can serve up OSTs for multiple file systems, and an MDS
node can serve up MDTs for multiple file systems (and a node could even
be both an MDS and OSS at the same time).

Now, if there were a separate MGS for each file system, then it would be
a different story... each node can really only serve up OSTs or MDTs for
a single MGS.

Kevin

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Cannot mount MDS: Lustre: Denying initial registration attempt from nid 10.201.62...@o2ib, specified as failover

2010-11-21 Thread Kevin Van Maren
Adrian Ulrich wrote:
 Hi Kevin,

   
 But you specified that as a failover node:
   # tunefs.lustre --erase-params 
 --param=failover.node=10.201.62...@o2ib,10.201.30...@tcp 
 failover.node=10.201.62...@o2ib,10.201.30...@tcp 
 mdt.group_upcall=/usr/sbin/l_getgroups /dev/md10
 

 Well: First i was just running

 # tunefs.lustre --param mdt.quota_type=ug /dev/md10

 and this alone was enough to break it.
   

Not sure.

 did you specify both sets on your mkfs command line?
 

 The initial installation was done / dictated by the swiss branch of
 an (no longer existing) three-letter company. This command was used
 to create the filesystem on the MDS

 # FS_NAME=lustre1
 # MGS_1=10.201.62...@o2ib0,10.201.30...@tcp0
 # MGS_2=10.201.62...@o2ib0,10.201.30...@tcp0
 # mkfs.lustre --reformat --fsname ${FS_NAME} --mdt --mgs --failnode=${MGS_1} 
 --failnode=${MGS_2} /dev/md10
   

I haven't done combined mdt/mgs for a while, so I can't recall if you 
have to specify the mgs NIDs for the MDT when it is colocated with the 
MGS, but I think the command should have been more like:

# mkfs.lustre --fsname ${FS_NAME} --mdt --mgs --failnode=${MGS_2} 
--mgsnode=${MGS_1} --mgsnode=${MGS_2} /dev/md10
with the mkfs/first mount on MGS_1.

As I mentioned, you would not normally specify the mkfs/first-mount NIDs 
as failover parameters, as they are added automatically by Lustre.

Kevin

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] LBUG on lustre 1.8.0

2010-11-21 Thread Kevin Van Maren
Sure, but I think for engineering to make progress on this bug, they  
are going to want a crash dump.  If you can enable crash dumps and  
panic on lbug (and if HA, increase dead timeout so it can complete the  
dump before being shot in the head) it would provide more info for the  
bug report.

That being said, there are quite a few other bugs that have been fixed  
since 1.8.0, so you really should upgrade ASAP to 1.8.4.

Kevin


On Nov 21, 2010, at 6:59 PM, Larry tsr...@gmail.com wrote:

 We had a LBUG several days ago on our lustre 1.8.0. One OSS reported

 kernel: LustreError:
 24669:0:(service.c:1311:ptlrpc_server_handle_request())
 ASSERTION(atomic_read((export)-exp_refcount)  0x5a5a5a) failed
 kernel: LustreError:
 24669:0:(service.c:1311:ptlrpc_server_handle_request()) LBUG
 kernel: Lustre: 24669:0:(linux-debug.c:222:libcfs_debug_dumpstack())
 showing stack for process 24669
 ..

 I google for this, and find little information about it. It seems to
 be a race condition on OSS, right? Should I open a bugzilla for this
 LBUG?
 Thanks.
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] LBUG on lustre 1.8.0

2010-11-21 Thread Kevin Van Maren
Larry wrote:
 We add the options libcfs libcfs_panic_on_lbug=1 in modprobe.conf to
 make the server kernel panic ASAP the LBUG happened. Is there some way
 to make the server dead a few seconds after the LBUG? We are also
 puzzled with the message lost during the LBUG happened.
   

The messages should have gone to the console just fine (hopefully you 
are logging a serial console).
If you are talking about /var/log/messages, then yes, it will be missing 
the final output as the
messages don't have time to get written to disk on a kernel panic.

Kevin


 On Mon, Nov 22, 2010 at 10:42 AM, Kevin Van Maren
 kevin.van.ma...@oracle.com wrote:
   
 Sure, but I think for engineering to make progress on this bug, they are
 going to want a crash dump.  If you can enable crash dumps and panic on lbug
 (and if HA, increase dead timeout so it can complete the dump before being
 shot in the head) it would provide more info for the bug report.

 That being said, there are quite a few other bugs that have been fixed since
 1.8.0, so you really should upgrade ASAP to 1.8.4.

 Kevin


 On Nov 21, 2010, at 6:59 PM, Larry tsr...@gmail.com wrote:

 
 We had a LBUG several days ago on our lustre 1.8.0. One OSS reported

 kernel: LustreError:
 24669:0:(service.c:1311:ptlrpc_server_handle_request())
 ASSERTION(atomic_read((export)-exp_refcount)  0x5a5a5a) failed
 kernel: LustreError:
 24669:0:(service.c:1311:ptlrpc_server_handle_request()) LBUG
 kernel: Lustre: 24669:0:(linux-debug.c:222:libcfs_debug_dumpstack())
 showing stack for process 24669
 ..

 I google for this, and find little information about it. It seems to
 be a race condition on OSS, right? Should I open a bugzilla for this
 LBUG?
 Thanks.
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] [Fwd: Re: Broken client]

2010-11-19 Thread Kevin Van Maren
Not sure.  Could be some clients had data in their cache, and others  
hit the error when they tried to get it from the OST.

Sorry I misunderstood you -- I thought you had already run fsck on the  
OSTs.

Kevin


On Nov 19, 2010, at 9:41 AM, Herbert Fruchtl herbert.fruc...@st-andrews.ac.uk 
  wrote:

 Thanks guys, Looks like unmounting the unhealthy OST filesystem  
 and running an
 fsck on it (which found several errors) solved the problem! I still  
 don't
 understand why it looked different from different clients...

 Cheers,

   Herbert

 Oleg Drokin wrote:
 Hello!

  So are there any other compplaints on the OSS node when you mount  
 that OST?
  Did you try to run e2fsck on the ost disk itself (while  
 unmounted)? I assume one of the possible problems is just on0disk  
 fs corruptions
  (and it might show unhealthy due to that right after mount too).

 Bye,
Oleg
 On Nov 18, 2010, at 1:47 PM, Herbert Fruchtl wrote:

 Sorry, I had meant to cc this to the list.

 Herbert

 From: Herbert Fruchtl herbert.fruc...@st-andrews.ac.uk
 Date: November 18, 2010 12:56:53 PM EST
 To: Kevin Van Maren kevin.van.ma...@oracle.com
 Subject: Re: [Lustre-discuss] Broken client


 Hi Kevin,

 That didn't change anything. Umounting the of the OSTs hung (yes,  
 with an LBUG), and I did a hard reboot. It came up again, and the  
 status is as before: on the MDT server, I can see all files (well,  
 I assume it's all); on the client in question some files appear  
 broken. The OST is still not healthy. I am running another  
 lfsck, without much hope. Here's the LBUG:

 Nov 18 17:05:16 oss1-fs kernel: LustreError: 8125:0: 
 (lprocfs_status.c:865:lprocfs_free_client_stats()) LBU

 Herbert

 Kevin Van Maren wrote:
 Reboot the server with the unhealthy OST.
 If you look at the logs, there is likely an LBUG that is causing  
 the problems.
 Kevin
 On Nov 18, 2010, at 9:51 AM, Herbert Fruchtl 
 herbert.fruc...@st-andrews.ac.uk 
  wrote:
 It looks like you may have corruption on the mdt or an ost,  
 where the
 objects on an OST can't be found for the directory entry. Have  
 you
 had a crash recently or run Lustre fsck? You might need to do  
 fsck and
 delete (unlink) the broken files.

 The files do exist (I can see them on the mdt server) and I  
 don't want to delete
 them. There was a crash lately, and I have run an lfsck  
 afterwards (repeatedly,
 actually.

 I suppose it's also possible you're seeing fallout from an  
 earlier LBUG or
 something. Try 'cat /proc/fs/lustre/health_check' on all the  
 servers.

 There seems to be a problem:
 [r...@master ~]# cat /proc/fs/lustre/health_check
 healthy
 [r...@master ~]# ssh oss1 'cat /proc/fs/lustre/health_check'
 device home-OST0005 reported unhealthy
 NOT HEALTHY
 [r...@master ~]# ssh oss2 'cat /proc/fs/lustre/health_check'
 healthy
 [r...@master ~]# ssh oss3 'cat /proc/fs/lustre/health_check'
 healthy

 What do I do about the unhealthy OST?

 Herbert
 -- 
 Herbert Fruchtl
 Senior Scientific Computing Officer
 School of Chemistry, School of Mathematics and Statistics
 University of St Andrews
 -- 
 The University of St Andrews is a charity registered in Scotland:
 No SC013532
 -- 
 Herbert Fruchtl
 Senior Scientific Computing Officer
 School of Chemistry, School of Mathematics and Statistics
 University of St Andrews
 --
 The University of St Andrews is a charity registered in Scotland:
 No SC013532



 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss


 -- 
 Herbert Fruchtl
 Senior Scientific Computing Officer
 School of Chemistry, School of Mathematics and Statistics
 University of St Andrews
 --
 The University of St Andrews is a charity registered in Scotland:
 No SC013532
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Broken client

2010-11-18 Thread Kevin Van Maren
Wang Yibin wrote:
 Hello,

 在 2010-11-18,下午10:03, Herbert Fruchtl 写道:

 I was wrong about only one client having problems. It seems to
 be all of them, except the mds server (see below), so it is a
 problem of the filesystem (not the client) after all.

It looks like you may have corruption on the mdt or an ost, where the
objects on an OST can't be found for the directory entry. Have you
had a crash recently or run Lustre fsck? You might need to do fsck and
delete (unlink) the broken files.

I suppose it's also possible you're seeing fallout from an earlier LBUG or
something. Try 'cat /proc/fs/lustre/health_check' on all the servers.

Kevin

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Network problems

2010-10-28 Thread Kevin Van Maren
Arne Brutschy wrote:
 Hi all,

 we're using Lustre 1.8.3 on a gigabit network. We have 4 osts on 2 oss
 and a single mgs. We are serving the users' homedirs (mostly small
 files) for 64 clients on this network. It now happened for the third
 time that the cluster went down: either the oss or the mgs block, and
 nobody can access the lustre share anymore.

 Looking at the logs, I see lots of connectivity errors:
 
 LustreError: 17792:0:(mgs_handler.c:641:mgs_handle()) MGS handle 
 cmd=250 rc=-16
 LustreError: 17792:0:(mgs_handler.c:641:mgs_handle()) Skipped 3 
 previous similar messages
 LustreError: 17792:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ 
 processing error (-16)  r...@f5ae642c x1344336331331741/t0 
 o250-6e1c6cb5-564f-49b0-a01e-e7e460542...@net_0x20ac1_uuid:0/0 lens 
 368/264 e 0 to 0 dl 1288258643 ref 1 fl Interpret:/0/0 rc -16/0
 LustreError: 17792:0:(ldlm_lib.c:1892:target_send_reply_msg()) 
 Skipped 3 previous similar messages
 Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ 
 Request x1348047450260821 sent from lustre-MDT-mdc-f3aa2e00 to NID 
 0...@lo 30s ago has timed out (30s prior to deadline).
   r...@e66d1e00 x1348047450260821/t0 
 o38-lustre-mdt_u...@10.1.1.1@tcp:12/10 lens 368/584 e 0 to 1 dl 
 1288258632 ref 1 fl Rpc:N/0/0 rc 0/0
 Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 19 
 previous similar messages
 Lustre: MGS: haven't heard from client 
 03b9cdae-66f1-552b-8c7c-94a9499c8dcf (at 10.255.255@tcp) in 228 seconds. 
 I think it's dead, and I am evicting it.
 LustreError: 2893:0:(acceptor.c:455:lnet_acceptor()) Error -11 
 reading connection request from 10.255.255.199
 Lustre: 2936:0:(ldlm_lib.c:575:target_handle_reconnect()) MGS: 
 fa602b20-b24c-bbcd-7003-b3b9bf702db4 reconnecting
 Lustre: 2936:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 49 
 previous similar messages
 LustreError: 2888:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Error 
 -104 reading HELLO from 10.255.255.199
 LustreError: 2888:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Skipped 
 2 previous similar messages
 LustreError: 11b-b: Connection to 10.255.255@tcp at host 
 10.255.255.199 on port 988 was reset: is it running a compatible version of 
 Lustre and is 10.255.255@tcp one of its NIDs?
 Lustre: 20268:0:(ldlm_lib.c:804:target_handle_connect()) MGS: exp 
 eb3ff200 already connecting
 Lustre: 17792:0:(ldlm_lib.c:875:target_handle_connect()) MGS: refuse 
 reconnection from fa602b20-b24c-bbcd-7003-b3b9bf702...@10.255.255.199@tcp to 
 0xeb3ff200; still busy with 2 active RPCs
   

It looks like the server threads are spending a long time processing the 
request.  If you look at the client logs for 10.255.255.199 you will 
likely see that it thinks the server died and tried to failover.  The 
server, when it finally got around to processing the request, noticed 
that the client was no longer there, as it had given up on the server.  
The server, for sanity, won't allow the client to reconnect until the 
outstanding transactions have completed (so the question is why are they 
taking so long).

Are you seeing any slow messages on the servers?  There are lots of 
reasons server threads could be slow.  If /proc/sys/vm/zone_reclaim_mode
is 1, try setting it to 0.

You might want to try the patch in Bug 23826 which I found useful in 
tracking how long the server thread was processing the request, rather 
than just the IO phase.

 Lustre: 17792:0:(ldlm_lib.c:875:target_handle_connect()) Skipped 1 
 previous similar message
 Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ 
 Request x1348047450261066 sent from lustre-MDT-mdc-f3aa2e00 to NID 
 0...@lo 30s ago has timed out (30s prior to deadline).
   r...@ce186400 x1348047450261066/t0 
 o38-lustre-mdt_u...@10.1.1.1@tcp:12/10 lens 368/584 e 0 to 1 dl 
 1288259252 ref 1 fl Rpc:N/0/0 rc 0/0
 Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 19 
 previous similar messages
 Lustre: There was an unexpected network error while writing to 
 10.255.255.221: -110.
 Lustre: 20268:0:(ldlm_lib.c:804:target_handle_connect()) MGS: exp 
 d8637600 already connecting
 Lustre: 19656:0:(ldlm_lib.c:875:target_handle_connect()) MGS: refuse 
 reconnection from 7ee2fe58-3fab-c39a-8adb-c356d1bdc...@10.255.255.209@tcp to 
 0xd8637600; still busy with 1 active RPCs
 LustreError: 19656:0:(mgs_handler.c:641:mgs_handle()) MGS handle 
 cmd=250 rc=-16
 LustreError: 19656:0:(mgs_handler.c:641:mgs_handle()) Skipped 3 
 previous similar messages
 LustreError: 19656:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ 
 processing error (-16)  r...@f5ae6c2c x1344329600678410/t0 
 o250-7ee2fe58-3fab-c39a-8adb-c356d1bdc...@net_0x20ad1_uuid:0/0 lens 
 368/264 e 0 to 0 dl 1288259427 

Re: [Lustre-discuss] 1.8 quotas

2010-10-22 Thread Kevin Van Maren
David Dillow wrote:
 On Fri, 2010-10-22 at 22:56 +0800, Fan Yong wrote:
   
 On 10/22/10 9:37 PM, Jason Hill wrote:
 
 Folks,

 Not having to deal with quotas on our scratch filesystems in the past, I'm
 puzzled on why we're seeing messages like the following:

 Oct 22 09:29:00 widow-oss3c2 kernel: kernel: Lustre: widow3-OST00b1: slow 
 quota init 35s due to heavy IO load

 We're (I think) not doing quotas. 
   
 [ ... ]
   
 So, the question is - if we see messages like slow quota init, are quotas
 being calculated in the background? And as a followup - how do we turn them
 off?
   

   
 No. I think you are misguided by the message slow quota init 35s due to 
 heavy IO load, which does not mean recalculating (initial calculating) 
 quota in the background. In fact, such message is printed out before 
 obdfilter write, at such point, the OST tries to acquire enough quota 
 for this write operation. It will check locally whether the remaining 
 quota related with the uid/gid (for this OST object) is enough or not, 
 if not, the quota slave on this OST will acquire more quota from quota 
 master on MDS. This process maybe take some long time on high load 
 system, especially when the remaining quota on quota master (MDS) is 
 also very limit. The message you saw just shows that. There is no good 
 way to disable these message so long as setting quota on this uid/gid.
 

 This is the heart of Jason's question -- he has done nothing to his
 knowledge to enable quotas at all, so why is he getting a message about
 quotas? Are they actually enabled on the FS, and how would he be able to
 verify that?

 Or does it always process quotas, even if they are not enabled?
   


That message, from lustre/obdfilter/filter_io_26.c, is the result of the 
thread taking 35 second
from when it entered filter_commitrw_write() until after it called 
lquota_chkquota() to check the quota.

However, it is certainly plausible that the thread was delayed because 
of something other than quotas,
such as an allocation (eg, it could have been stuck in filter_iobuf_get).

Kevin

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] bad csum errors

2010-09-28 Thread Kevin Van Maren
https://bugzilla.lustre.org/show_bug.cgi?id=11742

Kevin


John White wrote:
 Hello Folks,
   Recently we've had a fair number of messages akin to the following 
 coming from out OSS syslog:
 n0004: LustreError: 168-f: lrc-OST0002: BAD WRITE CHECKSUM: changed in 
 transit before arrival at OST from 12345-10.4.8@o2ib inum 
 1409775/2324736913 object 1771080/0 extent [401408-2809855]
 n0004: LustreError: Skipped 13 previous similar messages
 n0004: LustreError: 10839:0:(ost_handler.c:1169:ost_brw_write()) client csum 
 ae09a542, original server csum cfb6ab4b, server csum now cfb6ab4b

 There appear to be no specific clients, OSSs or OSTs in common.  We'll 
 commonly get a block of messages concerning one OST w/ different clients 
 involved and then move on to another OST.  As such, I'm doubting this is a 
 memory issue.  Previous mails on this list mention MMAP, but there doesn't 
 seem to be any mention in these messages.  Ideas?

 
 John White
 High Performance Computing Services (HPCS)
 (510) 486-7307
 One Cyclotron Rd, MS: 50B-3209C
 Lawrence Berkeley National Lab
 Berkeley, CA 94720

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Profiling data

2010-09-28 Thread Kevin Van Maren
Maybe something like RobinHood?


On Sep 28, 2010, at 2:41 PM, David Noriega tsk...@my.utsa.edu wrote:

 This question isn't really about Lustre, but file system
 administration. I was wondering what tools exist, particularly
 anything free/open source, that can scan for old files and either
 report to the admin or user that said files are say 1yr old, please
 archive them or delete them. Also any tools that can profile file
 types, such as to check if someone is keeping their mp3 library on our
 server.

 Thanks
 David

 -- 
 Personally, I liked the university. They gave us money and facilities,
 we didn't have to produce anything! You've never been out of college!
 You don't know what it's like out there! I've worked in the private
 sector. They expect results. -Ray Ghostbusters
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre 1.8.4 with new kernel 2.6.18-194.11.4

2010-09-21 Thread Kevin Van Maren
https://bugzilla.lustre.org/show_bug.cgi?id=22514

Have you tried the 1.8.4 client on the stock kernel?

Kevin


Mike Hanby wrote:
 Are there any plans to build new Lustre 1.8.4 patched kernel packages for EL5 
 kernel 2.6.18-194.11.4

 This kernel has the patch that prevents the much talked about privilege 
 escalation CVE-2010-3081:
 https://rhn.redhat.com/errata/RHSA-2010-0704.html

 Regards,

 Mike


 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts

2010-09-20 Thread Kevin Van Maren
Bernd Schubert wrote:
 Hello Cory,

 On 09/17/2010 11:31 PM, Cory Spitz wrote:
   
 Hi, Bernd.

 On 09/17/2010 02:48 PM, Bernd Schubert wrote:
 
 On Friday, September 17, 2010, Andreas Dilger wrote:
   
 On 2010-09-17, at 12:42, Jonathan B. Horen wrote:
 
 We're trying to architect a Lustre setup for our group, and want to
 leverage our available resources. In doing so, we've come to consider
 multi-purposing several hosts, so that they'll function simultaneously
 as MDS  OSS.
   
 You can't do this and expect recovery to work in a robust manner.  The
 reason is that the MDS is a client of the OSS, and if they are both on the
 same node that crashes, the OSS will wait for the MDS client to
 reconnect and will time out recovery of the real clients.
 
 Well, that is some kind of design problem. Even on separate nodes it can 
 easily happen, that both MDS and OSS fail, for example power outage of the 
 storage rack. In my experience situations like that happen frequently...

   
 I think that just argues that the MDS should be on a separate UPS.
 

Or dual-redundant UPS devices driving all critical infrastructure.  
Redundant power supplies
are the norm for server-class hardware, and they should be cabled to 
different circuits (which
each need to be sized to sustain the maximum power).

 well, there is not only a single reason. Next hardware issue is that
 maybe an IB switch fails. 

Sure, but that's also easy to address (in theory): put OSS nodes on 
different leaf switches than
MDS nodes, and put the failover pairs on different switches as well.

In practice, IB switches probably do not fail often enough to worry 
about recovery glitches,
especially if they have redundant power, but I certainly recommend 
failover partners are on
different switch chips so that in case of a failure it is still possible 
to get the system up.

I would also recommend using bonded network interfaces to avoid 
cable-failure issues (ie,
connect both OSS nodes to both of the leaf switches, rather than one to 
each), but there are
some outstanding issues with Lustre on IB bonding (patches in bugzilla), 
and of course
multipath to disk (loss of connectivity to disk was mentioned at LUG as 
one of the
biggest causes of Lustre issues).  In general it is easier to have 
redundant cables than to
ensure your HA package properly monitors cable status and does a 
failover when required.

 And then have also seen cascading Lustre
 failures. It starts with an LBUG on the OSS, which triggers another
 problem on the MDS...
   
Yes, that's why bugs are fixed.  panic_on_lbug may help stop the problem 
before it spreads,
depending on the issue.

 Also, for us this actually will become a real problem, which cannot be
 easily solved. So this issue will become a DDN priority.


 Cheers,
 Bernd

 --
 Bernd Schubert
 DataDirect Networks

   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Question about adaptive timeouts, not sending early reply

2010-09-18 Thread Kevin Van Maren
I believe this message says that the request timeout is on this 
transaction is 42s, but when Lustre
went to go ask for more time based on the current AT service estimate, 
it came up with 30s.
Since 30s is  42s, it could not ask for more time.

Kevin


Thomas Roth wrote:
 Hi all,

 I'm trying to understand MDT logs and adaptive timeouts. After upgrade
 to 1.8.4 and while users believed Lustre to be still in maintenance (=
 no activity), the MDT log just shows

 Lustre: 19823:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@
 Couldn't add any time (42/30), not sending early reply

 Now, for historical reasons of running on a very shaky network, we load
 the lustre module with

 options ptlrpc at_max=6000
 options ptlrpc at_history=6000
 options ptlrpc at_early_margin=50

 Right now however, the MDT reports:

 lxmds:~# lctl get_param -n mdt.MDS.mds.timeouts
 service : cur  30  worst  76 (at 1284734311, 0d19h33m39s ago)  30 30  30  30

 Reading the manual on adaptive timeouts again, I conclude that if the
 current estimate for timeout is 30 sec, the MDT is indeed hard pressed
 to send an early reply 50 sec before that timeout occurs. The log
 messages states something of the like, (42/30).
 So, is my assessment correct? Are these log messages just due to the
 stupid at_early_margin setting?

 Regards,
 Thomas
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] 1.8.4 and write-through cache

2010-09-16 Thread Kevin Van Maren
Stu Midgley wrote:
 Afternoon

 I upgraded our oss's from 1.8.3 to 1.8.4 on Saturday (due to
 https://bugzilla.lustre.org/show_bug.cgi?id=22755) and suffered a
 great deal of pain.

 We have 30 oss's of multiple vintages.  The basic difference between them is

   * md on first 20 nodes
   * 3ware 9650SE ML12 on last 10 nodes

 After the upgrade to 1.8.4 we were seeing terrible throughput on the
 nodes with 3ware cards (and only the nodes with 3ware cards).  This
 was typified by see the block device being 100% utilised (iostat),
 doing about 100r/s and 400kb/s and all the ost_io threads in D state
 (no writes).  They would be in this state for 10mins and then suddenly
 awake and start pushing data again.  1-2 mins later, they would lock
 up again.

 The oss's were dumping stacks all over the place, crawling along and
 generally making our lustrefs unuseable.
   

Would you post a few of the stack traces?  Presumably these were driven 
by watchdog timeouts,
but it would help to know where they were getting stuck.

 After trying different kernels, raid card drivers, changing write back
 policy on the raid cards etc. the solution was to

 lctl set_param obdfilter.*.writethrough_cache_enable=0
 lctl set_param obdfilter.*.read_cache_enable=0

 on all the nodes with the 3ware cards.

 Has anyone else seen this?  I am completely baffled as to why it only
 affects our nodes with 3ware cards.

 These nodes were working very well under 1.8.3...


   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Oss Error and 0 byte files

2010-09-09 Thread Kevin Van Maren
I believe grant leak is still possible with 1.8.4, but many of the holes 
are plugged.

Kevin


Gabriele Paciucci wrote:
 the bug 22755 is fixed in 1.8.4

 http://wiki.lustre.org/index.php/Use:Change_Log_1.8




 On 09/09/2010 11:55 AM, Gianluca Tresoldi wrote:
 Yes, client gets ENOSP,I see now.

 Anyway: ThankYou Very Much for your reply ;)


 On 09/08/10 17:29, Kevin Van Maren wrote:
 It might be related to bug 22755, but there the client gets ENOSPC


 On Sep 8, 2010, at 8:02 AM, Gianluca Tresoldi 
 gianluca.treso...@tuttogratis.com 
 mailto:gianluca.treso...@tuttogratis.com wrote:

 Hello everyone

 I've an installation with Lustre 1.8.2, Centos 5, x86_64 and  I 
 encountered this problem:

 After several months of smooth operation, client begin to write 
 empty files without log error,from their point of view writing was 
 successful.

 OSS wrote, in their log, several lines like:
 Sep  8 12:40:31 tgoss-0200 kernel: LustreError: 
 5816:0:(filter_io.c:183:filter_grant_space_left()) lfs01-OST: 
 cli 20d94382-3300-f12e-65d1-c0f1743e1e20/8106a4e30a00 grant 
 39956230144  available 39956226048 and pending 0

 I checked the availability of space and inodes, but this is not the 
 problem.

 the problem goes away by rebooting ost.

 This is the second time I have, first at july 2010, second 
 september 2010.

 Any ideas?It's a bug?

 Thanks
 -- 
 Gianluca Tresoldi

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Virtual machines

2010-09-08 Thread Kevin Van Maren
I seem to recall Mellanox presenting a paper on IB support virtual  
machines at SC two years ago.  I think it was just a proof of concept,  
and I'm unaware of the current status.

Kevin


On Sep 8, 2010, at 6:09 AM, Brian J. Murrell  
brian.murr...@oracle.com wrote:

 On Wed, 2010-09-08 at 05:50 -0500, Brian O'Connor wrote:
 Does lustre work in a VM?

 Yes, of course, given that a VM provides an entire virtual computer.

 what about in a VM over Infiniband?

 I don't know of any VMs which expose the hosts Infiniband hardware for
 the VM to use directly.  Xen might.  libvirt/kvm might.  But those are
 just WAGs.

 b.

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Oss Error and 0 byte files

2010-09-08 Thread Kevin Van Maren

It might be related to bug 22755, but there the client gets ENOSPC


On Sep 8, 2010, at 8:02 AM, Gianluca Tresoldi gianluca.treso...@tuttogratis.com 
 wrote:



Hello everyone

I've an installation with Lustre 1.8.2, Centos 5, x86_64 and  I  
encountered this problem:


After several months of smooth operation, client begin to write  
empty files without log error,from their point of view writing was  
successful.


OSS wrote, in their log, several lines like:
Sep  8 12:40:31 tgoss-0200 kernel: LustreError: 5816:0:(filter_io.c: 
183:filter_grant_space_left()) lfs01-OST: cli 20d94382-3300- 
f12e-65d1-c0f1743e1e20/8106a4e30a00 grant 39956230144   
available 39956226048 and pending 0


I checked the availability of space and inodes, but this is not the  
problem.


the problem goes away by rebooting ost.

This is the second time I have, first at july 2010, second september  
2010.


Any ideas?It's a bug?

Thanks
--
Gianluca Tresoldi
***SysAdmin***
***Demon's Trainer***
Tuttogratis Italia Spa
E-mail: gianluca.treso...@tuttogratis.com
http://www.tuttogratis.it
Tel Centralino 02-57313101
Tel Diretto 02-57313136
linux40.jpgBe open...
*** Confidentiality Notice  Disclaimer *
This message, together with any attachments, is for the confidential
and exclusive use of the addressee(s). If you receive it in error,
please delete the message and its attachments from your system
immediately and notify us by return e-mail.
Do not disclose, copy, circulate or use any information contained in
this e-mail.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre requirements and tuning tricks

2010-09-08 Thread Kevin Van Maren
On Sep 8, 2010, at 8:25 AM, Joe Landman  
land...@scalableinformatics.com wrote:

 Joan J. Piles wrote:

 And then 2 MDS like these:

 - 2 x Intel 5520 (quad core) processor (or equivalent).
 - 36Gb RAM.
 - 2 x 64Gb SSD disks.
 - 2 x10Gb Ethernet ports.

 Hmmm 

In general there is not much gain from using SSD for MDT, and  
depending on the SSD, it could do much _worse_ than spinning rust.   
Many ssd controllers degrade horribly under the small random write  
workload.  (SSD are best for sequential write, random read).

Journals may receive some benefit, as the sequential write pattern  
works much better for SSDs, although SSDs are not normally needed there.



 After having read the documentation, it seems to be a sensible
 configuration, specially regarding the OSS. However we are not so  
 sure
 about the MDS. We have seen recommendations to reserve 5% of the  
 total
 file system space in the MDS. Is this true and then we should go for
 2x2Tb SAS disks for the MDS? Is SSD really worth there?

 There is a nice formula for approximating your MDS needs on the wiki.
 Basically it is something to the effect of

Number-of-inodes-planned * 1kB = storage space required

 So, for 10 million inodes, you need ~10 GB of space.  I am not sure if
 this helps, but you might be able to estimate your likely usage
 scenario.  Updating MDSes isn't easy (e.g. you have to pre-plan)


It is 4KB/inode on the MDT.  (It can be set to 2KB if you need 4  
billion files on an 8TB MDT).

My sizing rule of thumb has been ~ one MDT drive in RAID10 for each  
OST, to ensure you scale IOPS.



 And we have also read about having a separate storage for the OSTs'
 journals. Is it really useful to get a pair of extra small (16Gb) SSD
 disks for each OST to keep the journals and bitmaps?

It doesn't have to be SSD, and bitmaps are only applicable for  
software RAID.  But unless you use asynchronous journals, there is  
normally a big win from external journals -- even with HW RAID having  
non-volatile storage.  The bug win is putting journals on raid 1,  
rather than raid5/6.



 Finally, we have also read that it's important to have different  
 OSTs in
 different physical drives to avoid bottlenecks. Is thas so if we  
 make a
 big RAID volume and then several logical volumes (done with the  
 hardware
 raid card, the operating system would just see different block  
 devices)?

 Yes, though this will be suboptimal in performance.  You want  
 traffic to
 different LUNs not sharing the same physical disks.  Build smaller  
 RAID
 containers, and single LUNs atop those.

You get best performane with one HW RAID per OST.  And that RAID  
should be optimized for 1MB IO (ie, not. 6+p) for best performance  
without having to muck with a bunch of parameters.  If the OSTs are on  
the same drives, then there will be excessive head contention as  
different OST filesystems seek the same disks, greatly reducing  
throughput.


 -- 
 Joseph Landman, Ph.D
 Founder and CEO
 Scalable Informatics Inc.
 email: land...@scalableinformatics.com
 web  : http://scalableinformatics.com
http://scalableinformatics.com/jackrabbit
 phone: +1 734 786 8423 x121
 fax  : +1 866 888 3112
 cell : +1 734 612 4615
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Announce: Lustre 2.0.0 is available!

2010-08-26 Thread Kevin Van Maren
Yes


On Aug 27, 2010, at 7:04 AM, Mike Hanby mha...@uab.edu wrote:

 Are the release notes accurate in that OFED 1.5.1 is not supported  
 in Lustre 2.0.0 but it is supported in 1.8.4?

 -Original Message-
 From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre-discuss- 
 boun...@lists.lustre.org] On Behalf Of Terry Rutledge
 Sent: Thursday, August 26, 2010 12:55 PM
 Subject: [Lustre-discuss] Announce: Lustre 2.0.0 is available!

  Hi all,

 The entire Lustre team is pleased to announce the GA Release of Lustre
 2.0.0.
 This represents the first release of the main Lustre trunk in a number
 of years.
 The team has spent extraordinary efforts over the last year  
 preparing this
 release for GA. This release has had the most extensive pre-release  
 testing
 of any previous Lustre release.

 We are excited for the community to try this release and offer  
 feedback.

 Our next 2.x release is planned for later this year and details will  
 follow
 at a later date.

 Quick Reference:
 Lustre 2.0.0 is available on the Oracle Download Center Site.
 http://www.oracle.com/technetwork/indexes/downloads/sun-az-index-095901.html#L

 The Lustre 2.0 Operations Manual:
 http://dlc.sun.com/pdf/821-2076-10/821-2076-10.pdf

 The Release Notes:
 http://dlc.sun.com/pdf/821-2077-10/821-2077-10.pdf

 The change log:
 http://wiki.lustre.org/index.php/Change_Log_2.0

 As always, you can report issues via Bugzilla:
 https://bugzilla.lustre.org/

 To access earlier releases of Lustre, please check the box
 See previous products(P), then click L or scroll down to
 Lustre, the current and all previous releases (1.8.0 - 1.8.4)
 will be displayed.

 Happy downloading!

 -- The Lustre Team --


 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Enabling async journals while the filesystem is active

2010-08-20 Thread Kevin Van Maren
Yes, but depending on the Lustre version there are several bugs in the 
async journal code.

Kevin


Erik Froese wrote:
 Is it safe to enable async journals on the OSS's while the filesystem is 
 active?
 I'd like to see how it works for us.

 Thanks
 Erik
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] More detail regarding soft lockup error

2010-08-19 Thread Kevin Van Maren
Andreas _always_ recommends a backup first.

Kevin


Brian J. Murrell wrote:
 On Thu, 2010-08-19 at 10:09 -0600, Andreas Dilger wrote: 
   
 If you increase the size of the MDT (via resize2fs) it will increase the 
 number of inodes as well.
 

 Andreas: what is [y]our confidence level with resize2fs and our MDT?
 Given that I don't think we regularly (if at all) test this in our QA
 cycles (although I wish we would) I personally would be a lot more
 comfortable with a backup first.  What are your thoughts?  Unnecessary?

 b.
   
 

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Splitting lustre space

2010-08-18 Thread Kevin Van Maren
David Noriega wrote:
 OK hooray! Lustre setup with failover of all nodes, but now we have
 this huge lustre mount point. How can I say create /lustre/home and
 /lustre/groups and mount on the client?

 David
   

Two choices:

1) create two Lustre file systems (separate MDT and OSTs for each)
2) use mount --bind on the client to make one filesystem's directories 
show up in different places

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Splitting lustre space

2010-08-18 Thread Kevin Van Maren
David Noriega wrote:
 Ok, so I could do
 mount --bind /lustre/home /home
 mount --bind /lustre/groups /groups

 Is this a generally accepted practice with Lustre? This just seems so
 much like a nifty trick, but if its what the community uses, then ok.
   
It is a pretty nifty trick.  Same file system, so the same quotas (if 
any) would apply to both
directories.

 But ultimately if I wanted two separate filesystems, I would need more
 hardware? An OST can't be put into a general 'pool' for use between
 the two?
   

You probably don't need more hardware, but you would have to decide 
which file system
each OST would serve -- it can only provide space to one file system.  
So some of your
OSTs would be for home and some for groups.  You would need to have 2 
MDTs (if
necessary, you could split/partition the MDT you have).

Kevin

 David

 On Wed, Aug 18, 2010 at 12:33 PM, Kevin Van Maren
 kevin.van.ma...@oracle.com wrote:
   
 David Noriega wrote:
 
 OK hooray! Lustre setup with failover of all nodes, but now we have
 this huge lustre mount point. How can I say create /lustre/home and
 /lustre/groups and mount on the client?

 David

   
 Two choices:

 1) create two Lustre file systems (separate MDT and OSTs for each)
 2) use mount --bind on the client to make one filesystem's directories
 show up in different places


 



   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Question on setting up fail-over

2010-08-16 Thread Kevin Van Maren
David Noriega wrote:
 Ok I've gotten heartbeat setup with the two OSSs, but I do have a
 question that isn't stated in the documentation. Shouldn't the lustre
 mounts be removed from fstab once they are given to heartbeat since
 when it comes online, it will mount the resources, correct?

 David
   


Yes: on the servers, they must be not there or noauto.  Once you start 
running heartbeat,
you have given control of the resource away, and must not mount/umount 
it yourself
(unless you stop heartbeat on both nodes in the HA pair to get control 
back).

Kevin

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] ost pools

2010-08-10 Thread Kevin Van Maren
That's more likely if you file a bug report at bugzilla.lustre.org.
Even better if you modify check_and_complete_ostname in lctl
to handle your OST names and submit a patch with the bug.

Kevin


Stu Midgley wrote:
 Right, so I assume this means it will be fixed in some future version
 of lustre and until then I can't have those nodes in the pool until
 then?


 On Tue, Aug 10, 2010 at 3:41 PM, Andreas Dilger
 andreas.dil...@oracle.com wrote:
   
 On 2010-08-10, at 01:20, Stu Midgley wrote:
 
 # lctl pool_add l1.default l1-OST[10]
 OST l1-OST0010_UUID is not part of the 'l1' fs.
 pool_add: No such file or directory


 All the nodes that have the new-style names went into the pool just
 fine.  all the nodes with old-style names will not go into the pool.

 eg. ost_011_UUID
   
 I had a quick look at lctl::jt_pool_cmd(), and it looks like this checking 
 is done in userspace in check_and_complete_ostname(), to avoid bad 
 interactions with invalid OST names, and to allow short forms of the OST 
 to be used (e.g. OST0001 instead of l1-OST0001_UUID).

 That said, it should also be possible to have lctl scan the existing OST 
 UUID array via setup_obd_indexes(param-obd_uuid = ost_name) to see if the 
 OST name is actually valid before adding it to the pool.  That will iterate 
 over the list of OSTs, and use llapi_uuid_match() to see if the OST name is 
 valid.

 
 We have a lustre file system which started life at V1.4 and is now at V1.8.
  I'm keen to use ost pools, but I can't actually add nodes to the pool.  
 The node names are not in a format that lctl pool_add likes

 ost_011_UUID3.3T3.0T  331.5G  90% /l1[OST:10]

 lctl pool_add l1.default OST[10]
 OST l1-OST0010_UUID is not part of the 'l1' fs.
 pool_add: No such file or directory

 How do I get nodes with these names added to a pool?

 Thanks.
 
 --
 Dr Stuart Midgley
 sdm...@gmail.com
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   
 Cheers, Andreas
 --
 Andreas Dilger
 Lustre Technical Lead
 Oracle Corporation Canada Inc.


 



   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Question on setting up fail-over

2010-08-09 Thread Kevin Van Maren
On Aug 9, 2010, at 11:45 AM, David Noriega tsk...@my.utsa.edu wrote:

 My understanding of setting up fail-over is you need some control over
 the power so with a script it can turn off a machine by cutting its
 power? Is this correct?

It is the recommended configuration because it is simple to understand  
and implement.

But the only _hard_ requirement is that both nodes can access the  
storage.


 Is there a way to do fail-over without having
 access to the pdu(power strips)?

If you have IPMI support, that can be used for power control, instead  
of a switched PDU.  Depending on the storage, you may be able to do  
resource fencing of the disks instead of STONITH.  Or you can run fast- 
and-loose, without any way to ensure the dead node is really dead  
and not accessing storage (at your risk).  While Lustre has MMP, it is  
really more to protect against a mount typo than to guarantee resource  
fencing.


 Thanks
 David

 -- 
 Personally, I liked the university. They gave us money and facilities,
 we didn't have to produce anything! You've never been out of college!
 You don't know what it's like out there! I've worked in the private
 sector. They expect results. -Ray Ghostbusters
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Multiple FS in one MDS

2010-08-04 Thread Kevin Van Maren
If the question is whether you can have multiple file systems on the 
same servers, the answer is yes.
1) Need a new LUN/partition for the MDT
2) Need new LUN/partitions for the OST to provide space for that file 
system, as an OST belongs to exactly one Lustre file system

It looks like you added a second MDT, but did not add any OSTs for the 
NGS file system?

If the question is whether you can take one file system and mount parts 
of it at different places on the client, then that answer is also yes: 
look at mount --bind which can make a file system (or subdir) appear 
at a different location.

Kevin


Fabio Cassarotti Parronchi Navarro wrote:
 Hi,

 We recently started using Lustre in production environment for a small 
 storage and we are currently testing speed and reliability. So far, 
 everyone is exited with it.
 But here comes the problem. Is it possible to create another mount 
 point on the same server that is already running a MDS ?

 For example:
   [mdsho...@tcp:/Projects ( current )
   [mdsho...@tcp:/NGS  ( new )

 Actually, I've been able to create another partition on the MDS using 
 ( --fsname=NGS ) and mount it, the OSSs seems to be running nicely too 
 ( at least no problems are reported on the log files ). But when I try 
 to mount NGS file system on the clients, the mount command freezes 
 with no output on the logs

 mount -t lustre [mdsho...@tcp:/NGS /home/NGS/

 The MDS logs:

 Aug  4 08:32:02 pnq1 kernel: LustreError: 
 19304:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error 
 (-11)  r...@810112097800 x1338817395530034/t0 o38-?@?:0/0 lens 
 368/0 e 0 to 0 dl 1280921622 ref 1 fl Interpret:/0/0 rc -11/0
 Aug  4 08:32:24 pnq1 kernel: LustreError: 
 19304:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error 
 (-11)  r...@810226542000 x1338817395530041/t0 o38-?@?:0/0 lens 
 368/0 e 0 to 0 dl 1280921644 ref 1 fl Interpret:/0/0 rc -11/0
 Aug  4 08:32:24 pnq1 kernel: LustreError: 
 19304:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 1 previous 
 similar message
 Aug  4 08:33:01 pnq1 kernel: LustreError: 
 19297:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error 
 (-11)  r...@8101fc298c00 x1336109251604116/t0 o38-?@?:0/0 lens 
 368/0 e 0 to 0 dl 1280921681 ref 1 fl Interpret:/0/0 rc -11/0
 Aug  4 08:33:01 pnq1 kernel: LustreError: 
 19297:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 4 previous 
 similar messages
 Aug  4 08:33:07 pnq1 kernel: Lustre: NGS-MDT: temporarily refusing 
 client connection from 192.168.10...@tcp

 Do I have to change any config on the MDS to fix this issue? Or this 
 architecture is not supported by Lustre ?

 Thanks in advice,
 Fábio Navarro

 -- 
 Ludwig Insitute for Cancer Research LTDA
 Laboratory of Computational Biology
 245 João Julião St - 1th floor
 CEP 01323-903 - Sao Paulo - Brazil
 Phone: 55 11 33883232
 

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Client directory entry caching

2010-08-03 Thread Kevin Van Maren
Since Bug 22492 hit a lot of people, it sounds like opencache isn't  
generally useful unless enabled on every node. Is there an easy way to  
force files out of the cache (ie, echo 3  /proc/sys/vm/drop_caches)?


Kevin


On Aug 3, 2010, at 11:50 AM, Oleg Drokin oleg.dro...@oracle.com wrote:


Hello!

On Aug 3, 2010, at 12:49 PM, Daire Byrne wrote:
So even with the metadata going over NFS the opencache in the  
client
seems to make quite a difference (I'm not sure how much the NFS  
client
caches though). As expected I see no mdt activity for the NFS  
export
once cached. I think it would be really nice to be able to enable  
the
opencache on any lustre client. A couple of potential workloads  
that I
A simple workaround for you to enable opencache on a specific  
client would
be to add cr_flags |= MDS_OPEN_LOCK; in mdc/ 
mdc_lib.c:mds_pack_open_flags()

Yea that works - cheers. FYI some comparisons with a simple find on a
remote client (~33,000 files):

find /mnt/lustre (not cached) = 41 secs
find /mnt/lustre (cached) = 19 secs
find /mnt/lustre (opencache) = 3 secs


Hm, initially I was going to say that find is not open-intensive so  
it should

not benefit from opencache at all.
But then I realized if you have a lot of dirs, then indeed there  
would be a

positive impact on subsequent reruns.
I assume that the opencache result is a second run and first run  
produces

same 41 seconds?

BTW, another unintended side-effect you might experience if you have  
mixed
opencache enabled/disabled network is if you run something (or open  
for write)
on an opencache-enabled client, you might have problems writing (or  
executing)

that file from non-opencache enabled nodes as long as the file handle
would remain cached on the client. This is because if open lock was  
not requested,
we don't try to invalidate current ones (expensive) and MDS would  
think
the file is genuinely open for write/execution and disallow  
conflicting accesses

with EBUSY.


performance when compared to something simpler like NFS. Slightly off
topic (and I've kinda asked this before) but is there a good reason
why link() speeds in Lustre are so slow compare to something like  
NFS?
A quick comparison of doing a cp -al from a remote Lustre client  
and

an NFS client (to a fast NFS server):

cp -fa /mnt/lustre/blah /mnt/lustre/blah2 = ~362 files/sec
cp -fa /mnt/nfs/blah /mnt/nfs/blah2 = ~1863 files/sec

Is it just the extra depth of the lustre stack/code path? Is there
anything we could do to speed this up if we know that no other client
will touch these dirs while we hardlink them?


Hm, this is a first complaint about this that I hear.
I just looked into strace of cp -fal (which I guess you mant instead  
of just -fa that

would just copy everything).

so we traverse the tree down creating a dir structure in parallel  
first (or just doing it in readdir order)


open(/mnt/lustre/a/b/c/d/e/f, O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
+1 RPC

fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
+1 RPC (if no opencache)

fcntl(3, F_SETFD, FD_CLOEXEC)   = 0
getdents(3, /* 4 entries */, 4096)  = 96
getdents(3, /* 0 entries */, 4096)  = 0
+1 RPC

close(3)= 0
+1 RPC (if no opencache)

lstat(/mnt/lustre/a/b/c/d/e/f/g, {st_mode=S_IFDIR|0755,  
st_size=4096, ...}) = 0

(should be cached, so no RPC)

mkdir(/mnt/lustre/blah2/b/c/d/e/f/g, 040755) = 0
+1 RPC

lstat(/mnt/lustre/blah2/b/c/d/e/f/g, {st_mode=S_IFDIR|0755,  
st_size=4096, ...}) = 0

+1 RPC

stat(/mnt/lustre/blah2/b/c/d/e/f/g, {st_mode=S_IFDIR|0755,  
st_size=4096, ...}) = 0

(should be cached, so no RPC)

Then we get to files:
link(/mnt/lustre/a/b/c/d/e/f/g/k/8, /mnt/lustre/blah2/b/c/d/e/f/g/ 
k/8) = 0

+1 RPC

futimesat(AT_FDCWD, /mnt/lustre/blah2/b/c/d/e/f/g/k, {{1280856246,  
0}, {128085

6291, 0}}) = 0
+1 RPC

then we start traversing the just created tree up and chowning it:
chown(/mnt/lustre/blah2/b/c/d/e/f/g/k, 0, 0) = 0
+1 RPC

getxattr(/mnt/lustre/a/b/c/d/e/f/g/k, system.posix_acl_access,  
0x7fff519f0950, 132) = -1 ENODATA (No data available)

+1 RPC

stat(/mnt/lustre/a/b/c/d/e/f/g/k, {st_mode=S_IFDIR|0755,  
st_size=4096, ...}) = 0
(not sure why another stat here, we already did it on the way up.  
Should be cached)


setxattr(/mnt/lustre/blah2/b/c/d/e/f/g/k,  
system.posix_acl_access, \x02\x00
\x00\x00\x01\x00\x07\x00\xff\xff\xff\xff\x04\x00\x05\x00\xff\xff\xff 
\xff \x00\x0

5\x00\xff\xff\xff\xff, 28, 0) = 0
+1 RPC

getxattr(/mnt/lustre/a/b/c/d/e/f/g/k, system.posix_acl_default,  
0x7fff519f09

50, 132) = -1 ENODATA (No data available)
+1 RPC

stat(/mnt/lustre/a/b/c/d/e/f/g/k, {st_mode=S_IFDIR|0755,  
st_size=4096, ...}) =

0
Hm, stat again? did not we do it a few syscalls back?

stat(/mnt/lustre/blah2/b/c/d/e/f/g/k, {st_mode=S_IFDIR|0755,  
st_size=4096, ...

}) = 0
stat of the target. +1 RPC (the cache got invalidated by link above).

setxattr(/mnt/lustre/blah2/b/c/d/e/f/g/k,  
system.posix_acl_default, \x02\x0

0\x00\x00, 4, 

Re: [Lustre-discuss] Per directory quota

2010-07-16 Thread Kevin Van Maren
On Jul 16, 2010, at 7:17 AM, Christopher J.Walker c.j.wal...@qmul.ac.uk 
  wrote:

 I know Lustre can do quotas per user, but can Lustre do quotas on a  
 per
 directory basis?

No, Lustre does not support directory (fileset) based quotas.


 I can't work out how to do this from the manual.

 To be more specific, the software we use[1] is written by people using
 GPFS and we'd like an equivalent to the GPFS command:

mmlsquota -j

 which AIUI finds out how much space is used under a directory.

 We could use

du --summarize mydirectory

 but for a directory containing a large number of files, this takes a
 long time - and is presumably not very  efficient. If there were an  
 lfs
 du it would presumably be more efficient, but even so, probably still
 resource intensive.

Without size-on-mds, either way would have to query both the mds and  
each OST to get the size info.  Not being that familiar with size-on- 
mds, it does seem likely that du would still have to query the OST  
for size info, even when ls -l does not.


 Am I missing something?

 [1] http://storm.forge.cnaf.infn.it/home


 Chris

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] NFS Export Issues

2010-07-16 Thread Kevin Van Maren
Without more information about the server error messages and exact nfs 
configuration, not sure anyone can help more than this.  A common 
problem with Lustre NFS exports, one that isn't due to normal 
NFS/configuration issues, is getting error -43 when the mds did not have 
the client's IDs in its /etc/passwd and /etc/group files.

Dumb question, but have you checked the permissions on the NFS server's 
Lustre mount point (before/after Lustre is mounted), and exported a 
non-Lustre directory successfully?

Kevin


Andreas Dilger wrote:
 My only other suggestion is to dump the Lustre kernel debug log on the NFS 
 server after a mount failure to see where/why it is getting the permission 
 error. 

 # lctl clear
 # (mount NFS client)
 # lctl dk /tmp/debug

 Then search through the logs for -2 errors (-EPERM).

 Cheers, Andreas

 On 2010-07-16, at 10:06, William Olson lustre_ad...@reachone.com wrote:

   
 On 7/15/2010 5:48 PM, Andreas Dilger wrote:
 
 On 2010-07-15, at 08:33, William Olson wrote:
   
   
 Somebody, anybody?  I'm sure it's something fairly simple, but it
 escapes me, assistance would be greatly appreciated!
 

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Luster 1.8.3 Qlogic OFED 1.4.2

2010-07-16 Thread Kevin Van Maren
If you replace OFED, you do not need to rebuild the kernel (unless you 
want to patch/change it), so you can install the binary Lustre kernel.  
You do need to rebuild Lustre (or at least the kernel modules) (step 
#3), as o2ib must be built against the OFED you are running.

Kevin


Marco Aurelio L Gomes wrote:
 Hi,

 I saw the post above on the lustre-discuss list and would like to know
 if in case I install OFED 1.5 on Lustre 1.8.3 I'll need build the lustre
 kernel, or only install the available kernel from lustre download page.
 When I see that the build in (1) is optional I thought that is possible
 to use that available kernel.

 Thanks in advance.

 Best regards,

 Marco Gomes
 Systems/HPC-Cluster
 Numerical Offshore Tank
 Naval and Ocean Engineering Department's Laboratory
 Escola Politécnica
 University of São Paulo
 +55 11 3777 4142 ext. 250

 On Wed, 2010-06-09 at 12:48 -0700, Kevin Van Maren wrote:
   
 Looks like a mis-match on the OFED modules Lustre is expecting.

 If not using the included OFED, you need these steps:

 1) Build (optional) and install the Lustre kernel
 2) Build OFED against the Lustre kernel and install it
 3) Build Lustre against the kernel and the OFED you are using

 Lustre defaults to using the in-kernel OFED unless you point configure
 at a different set of headers.

 Kevin


 - Original Message -
 From: srirangam.addepa...@gmail.com
 To: lustre-discuss@lists.lustre.org
 Sent: Wednesday, June 9, 2010 1:33:04 PM GMT -07:00 US/Canada Mountain
 Subject: [Lustre-discuss] Luster 1.8.3 Qlogic OFED 1.4.2

 Hello All,
 I am trying to use luster with qlogic ofed 1.4.2 . After building and
 installing the kernel when i try modprobe lustre
 i get the following ! errors


 # modprobe lustre
 WARNING: Error inserting osc
 (/lib/modules/2.6.18-164.11.1.el5_lustre.1.8.3/kernel/fs/lustre/osc.ko): 
 Input/output error
 WARNING: Error inserting mdc
 (/lib/modules/2.6.18-164.11.1.el5_lustre.1.8.3/kernel/fs/lustre/mdc.ko): 
 Input/output error
 WARNING: Error inserting lov
 (/lib/modules/2.6.18-164.11.1.el5_lustre.1.8.3/kernel/fs/lustre/lov.ko): 
 Input/output error
 FATAL: Error inserting lustre
 (/lib/modules/2.6.18-164.11.1.el5_lustre.1.8.3/kernel/fs/lustre/lustre.ko): 
 Input/output error


 dmesg shows errors of the type

 Lustre: OBD class driver,
 http://www.lustre.org/   
 
 Lustre: Lustre Version:
 1.8.3
   
 Lustre: Build Version:
 1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3   
  !   
 ko2iblnd: disagrees about version of symbol
 ib_fmr_pool_unmap
   
 ko2iblnd: Unknown symbol
 ib_fmr_pool_unmap
  nbsp! ;   
 ko2iblnd: disagrees about version of symbol
 ib_create_cq 
   
 ko2iblnd: Unknown symbol
 ib_create_cq 
  nb! sp;   
 ko2iblnd: disagrees about version of symbol
 rdma_resolve_addr
   
 ko2iblnd: Unknown symbol rdma_resolve_addr 

 Following are the rpm's installed

 # rpm -qa | grep kernel
 kernel-doc-2.6.18-164.6.1.el5
 kernel-2.6.18-164.6.1.el5
 kernel-headers-2.6.18-164.6.1.el5
 kernel-ib-devel-1.4.2.1-2.6.18_164.11.1.el5_lustre.1.8.3
 kernel-2.6.18-164.11.1.el5_lustre.1.8.3
 kernel-ib-1.4.2.1-2.6.18_164.11.1.el5_lustre.1.8.3
 kernel-devel-2.6.18-164.6.1.el5
 kernel-devel-2.6.18-164.11.1.el5_lustre.1.8.3

 What am i missing.

 Ady

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 

   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] NFS Export Issues

2010-07-16 Thread Kevin Van Maren
Looks like a problem with your mount point.  What are the permissions  
on the client directory?


On Jul 16, 2010, at 6:23 PM, William Olson lustre_ad...@reachone.com  
wrote:

 On 7/16/2010 5:12 PM, Andreas Dilger wrote:

 Well that improved the debug level, but didn't reveal any -2  
 errors..   In fact I can't seem to find a line with an error in  
 it... Is there a specific verbiage used on error lines that I can  
 grep for?  90% is Process entered or Process leaving...

 You could try strace -f on the mount process, to see which  
 syscall is failing.  It may be failing with something before it  
 gets to Lustre.

 Results of strace below:

 [r...@lustreclient mnt]# strace -f -p 15964
 Process 15964 attached - interrupt to quit



 lstat(/mnt, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
 lstat(/mnt/lustre_mail_fs, 0x7fff4bd4b2b0) = -1 EACCES (Permission  
 denied)
 stat(/etc/localtime, {st_mode=S_IFREG|0644, st_size=2875, ...}) = 0


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] NFS Export Issues

2010-07-16 Thread Kevin Van Maren
But the client is doing a lstat on /mnt/lustre_mail_fs, not /mnt/ 
lustre -- what is the mount command again?


On Jul 16, 2010, at 6:50 PM, William Olson lustre_ad...@reachone.com  
wrote:

 On 7/16/2010 5:41 PM, Kevin Van Maren wrote:
 Looks like a problem with your mount point.  What are the  
 permissions on the client directory?

 NFSServer/Lustre Client

 Lustre mounted:
 drwxrwxrwx 29 root root 4.0K Jul 12 17:03 lustre_mail_fs
 Lustre not mounted:
 drwxrwxrwx 2 root root 4.0K Jun 10 13:26 lustre_mail_fs

 NFSClient mount dir:
 drwxrwxrwx  2 root root 4.0K Jul 12 15:09 lustre


 On Jul 16, 2010, at 6:23 PM, William Olson  
 lustre_ad...@reachone.com wrote:

 On 7/16/2010 5:12 PM, Andreas Dilger wrote:

 Well that improved the debug level, but didn't reveal any -2  
 errors..   In fact I can't seem to find a line with an error in  
 it... Is there a specific verbiage used on error lines that I  
 can grep for?  90% is Process entered or Process leaving...

 You could try strace -f on the mount process, to see which  
 syscall is failing.  It may be failing with something before it  
 gets to Lustre.

 Results of strace below:

 [r...@lustreclient mnt]# strace -f -p 15964
 Process 15964 attached - interrupt to quit



 lstat(/mnt, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
 lstat(/mnt/lustre_mail_fs, 0x7fff4bd4b2b0) = -1 EACCES  
 (Permission denied)
 stat(/etc/localtime, {st_mode=S_IFREG|0644, st_size=2875, ...})  
 = 0



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


  1   2   >