[Lustre-discuss] Need help on lustre filesystem setup..
Hi, I am trying to run Apache Hadoop project on parallel filesystem like lustre. I have 1 MDS, 2 OSS/OST and 1 Lustre Client. My lustre client shows: Code: [root@lustreclient1 ~]# lfs df -h UUID bytesUsed Available Use% Mounted on lustre-MDT_UUID 4.5G 274.3M3.9G 6% /mnt/lustre[MDT:0] lustre-OST_UUID 5.9G 276.1M5.3G 5% /mnt/lustre[OST:0] lustre-OST0001_UUID 5.9G 276.1M5.3G 5% /mnt/lustre[OST:1] lustre-OST0002_UUID 5.9G 276.1M5.3G 5% /mnt/lustre[OST:2] lustre-OST0003_UUID 5.9G 276.1M5.3G 5% /mnt/lustre[OST:3] lustre-OST0004_UUID 5.9G 276.1M5.3G 5% /mnt/lustre[OST:4] lustre-OST0005_UUID 5.9G 276.1M5.3G 5% /mnt/lustre[OST:5] lustre-OST0006_UUID 5.9G 276.1M5.3G 5% /mnt/lustre[OST:6] lustre-OST0007_UUID 5.9G 276.1M5.3G 5% /mnt/lustre[OST:7] lustre-OST0008_UUID 5.9G 276.1M5.3G 5% /mnt/lustre[OST:8] lustre-OST0009_UUID 5.9G 276.1M5.3G 5% /mnt/lustre[OST:9] lustre-OST000a_UUID 5.9G 276.1M5.3G 5% /mnt/lustre[OST:10] lustre-OST000b_UUID 5.9G 276.1M5.3G 5% /mnt/lustre[OST:11] filesystem summary:70.9G3.2G 64.0G 5% /mnt/lustre As I was unsure about which machine I need to install Hadoop softwareon, I decided to go ahead with installing Hadoop on LustreClient1. I configured LustreClient1 with JAVA_HOME and HADOOP parameter with the following files entry: File: conf/core-site.xml Code: property namefs.default.name/name valuefile:///mnt/lustre/value /property property namemapred.system.dir/name value${fs.default.name}/hadoop_tmp/mapred/system/value descriptionThe shared directory where MapReduce stores control files. /description /property I dint make changes in mapred-site.xml. Now when I start 'bin/start-mapred.sh' which tried to ssh to my own local machine. I am not sure if I am doing right. Doubt Do I need to have two Lustre Client for this to work? Then I tried running wordcount program shown below: Code: bin/hadoop jar hadoop-examples-1.1.1.jar wordcount /tmp/rahul /tmp/rahul/rahul-output ied 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 13/03/14 18:12:29 INFO ipc.Client: Retrying connect to server: 10.94.214.188/10.94.214.188:54311. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 13/03/14 18:12:30 INFO ipc.Client: Retrying connect to server: 10.94.214.188/10.94.214.188:54311. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 13/03/14 18:12:31 INFO ipc.Client: Retrying connect to server: 10.94.214.188/10.94.214.188:54311. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 13/03/14 18:12:32 INFO ipc.Client: Retrying connect to server: 10.94.214.188/10.94.214.188:54311. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) Question:1. As I have been comparing HDFS and Lustre for Hadoop, what would be the right number of hardware nodes to compare?Say, I have 1 MDS, 2 OSS and 1 Lustre Client, on the other hand, 1 Namenode and 3 datanodes? How can I compare both FS? Question:2. Do I really need 2 lustre client to setup Hadoop over Lustre? if it is possible, how can I use OSS and MDS too for Hadoop setup? Question:3. As I read regarding the wordcount example, we need to insert data into HDFS filesystem, do we need to do same for Lustre too? Whats the command? Question:4. What are the steps to confirm if HAdoop is actually using lustre FS? ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Need help on lustre..
I tried running the below command but got the below error. I have not put it into HDFS since Lustre is what I am trying to implement with. [code] #bin/hadoop jar hadoop-examples-1.1.1.jar wordcount /user/hadoop/hadoop /user/hadoop-output 13/02/17 17:02:50 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/02/17 17:02:50 INFO input.FileInputFormat: Total input paths to process : 1 13/02/17 17:02:50 WARN snappy.LoadSnappy: Snappy native library not loaded 13/02/17 17:02:50 INFO mapred.JobClient: Cleaning up the staging area file:/tmp/ hadoop-hadoop/mapred/staging/root/.staging/job_201302161113_0004 13/02/17 17:02:50 ERROR security.UserGroupInformation: PriviledgedActionExceptio n as:root cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException: java .io.FileNotFoundException: File file:/tmp/hadoop-hadoop/mapred/staging/root/.sta ging/job_201302161113_0004/job.xml does not exist. at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3731) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3695) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma tion.java:1136) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387) Caused by: java.io.FileNotFoundException: File file:/tmp/hadoop-hadoop/mapred/st aging/root/.staging/job_201302161113_0004/job.xml does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSys tem.java:397) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem. java:251) at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:406) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3729) ... 12 more org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.io.FileNotFound Exception: File file:/tmp/hadoop-hadoop/mapred/staging/root/.staging/job_2013021 61113_0004/job.xml does not exist. at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3731) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3695) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma tion.java:1136) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387) Caused by: java.io.FileNotFoundException: File file:/tmp/hadoop-hadoop/mapred/st aging/root/.staging/job_201302161113_0004/job.xml does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSys tem.java:397) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem. java:251) at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:406) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3729) ... 12 more at org.apache.hadoop.ipc.Client.call(Client.java:1107) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) at org.apache.hadoop.mapred.$Proxy1.submitJob(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at
Re: [Lustre-discuss] Need help on lustre..
Great !!! I tried removing entry from mapred-site.xml and it seems to run well. Here are the logs now: [code] [root@alpha hadoop]# bin/hadoop jar hadoop-examples-1.1.1.jar wordcount /user/ha doop/hadoop/ /user/hadoop/hadoop/output 13/02/17 17:14:37 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/02/17 17:14:38 INFO input.FileInputFormat: Total input paths to process : 1 13/02/17 17:14:38 WARN snappy.LoadSnappy: Snappy native library not loaded 13/02/17 17:14:38 INFO mapred.JobClient: Running job: job_local_0001 13/02/17 17:14:38 INFO util.ProcessTree: setsid exited with exit code 0 13/02/17 17:14:38 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache .hadoop.util.LinuxResourceCalculatorPlugin@2f74219d 13/02/17 17:14:38 INFO mapred.MapTask: io.sort.mb = 100 13/02/17 17:14:38 INFO mapred.MapTask: data buffer = 79691776/99614720 13/02/17 17:14:38 INFO mapred.MapTask: record buffer = 262144/327680 13/02/17 17:14:38 INFO mapred.MapTask: Starting flush of map output 13/02/17 17:14:39 INFO mapred.JobClient: map 0% reduce 0% 13/02/17 17:14:39 INFO mapred.MapTask: Finished spill 0 13/02/17 17:14:39 INFO mapred.Task: Task:attempt_local_0001_m_00_0 is done. And is in the process of commiting 13/02/17 17:14:39 INFO mapred.LocalJobRunner: 13/02/17 17:14:39 INFO mapred.Task: Task 'attempt_local_0001_m_00_0' done. 13/02/17 17:14:39 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache .hadoop.util.LinuxResourceCalculatorPlugin@6d79953c 13/02/17 17:14:39 INFO mapred.LocalJobRunner: 13/02/17 17:14:39 INFO mapred.Merger: Merging 1 sorted segments 13/02/17 17:14:39 INFO mapred.Merger: Down to the last merge-pass, with 1 segmen ts left of total size: 79496 bytes 13/02/17 17:14:39 INFO mapred.LocalJobRunner: 13/02/17 17:14:39 INFO mapred.Task: Task:attempt_local_0001_r_00_0 is done. And is in the process of commiting 13/02/17 17:14:39 INFO mapred.LocalJobRunner: 13/02/17 17:14:39 INFO mapred.Task: Task attempt_local_0001_r_00_0 is allowe d to commit now 13/02/17 17:14:39 INFO output.FileOutputCommitter: Saved output of task 'attempt _local_0001_r_00_0' to /user/hadoop/hadoop/output 13/02/17 17:14:39 INFO mapred.LocalJobRunner: reduce reduce 13/02/17 17:14:39 INFO mapred.Task: Task 'attempt_local_0001_r_00_0' done. 13/02/17 17:14:40 INFO mapred.JobClient: map 100% reduce 100% 13/02/17 17:14:40 INFO mapred.JobClient: Job complete: job_local_0001 13/02/17 17:14:40 INFO mapred.JobClient: Counters: 20 13/02/17 17:14:40 INFO mapred.JobClient: File Output Format Counters 13/02/17 17:14:40 INFO mapred.JobClient: Bytes Written=57885 13/02/17 17:14:40 INFO mapred.JobClient: FileSystemCounters 13/02/17 17:14:40 INFO mapred.JobClient: FILE_BYTES_READ=643420 13/02/17 17:14:40 INFO mapred.JobClient: FILE_BYTES_WRITTEN=574349 13/02/17 17:14:40 INFO mapred.JobClient: File Input Format Counters 13/02/17 17:14:40 INFO mapred.JobClient: Bytes Read=139351 13/02/17 17:14:40 INFO mapred.JobClient: Map-Reduce Framework 13/02/17 17:14:40 INFO mapred.JobClient: Map output materialized bytes=79500 13/02/17 17:14:40 INFO mapred.JobClient: Map input records=2932 13/02/17 17:14:40 INFO mapred.JobClient: Reduce shuffle bytes=0 13/02/17 17:14:40 INFO mapred.JobClient: Spilled Records=11180 13/02/17 17:14:40 INFO mapred.JobClient: Map output bytes=212823 13/02/17 17:14:40 INFO mapred.JobClient: Total committed heap usage (bytes)= 500432896 13/02/17 17:14:40 INFO mapred.JobClient: CPU time spent (ms)=0 13/02/17 17:14:40 INFO mapred.JobClient: SPLIT_RAW_BYTES=99 13/02/17 17:14:40 INFO mapred.JobClient: Combine input records=21582 13/02/17 17:14:40 INFO mapred.JobClient: Reduce input records=5590 13/02/17 17:14:40 INFO mapred.JobClient: Reduce input groups=5590 13/02/17 17:14:40 INFO mapred.JobClient: Combine output records=5590 13/02/17 17:14:40 INFO mapred.JobClient: Physical memory (bytes) snapshot=0 13/02/17 17:14:40 INFO mapred.JobClient: Reduce output records=5590 13/02/17 17:14:40 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0 13/02/17 17:14:40 INFO mapred.JobClient: Map output records=21582 [root@alpha hadoop]# [/code] Does it mean hadoop over lustre is working fine? On 2/17/13, linux freaker linuxfrea...@gmail.com wrote: I tried running the below command but got the below error. I have not put it into HDFS since Lustre is what I am trying to implement with. [code] #bin/hadoop jar hadoop-examples-1.1.1.jar wordcount /user/hadoop/hadoop /user/hadoop-output 13/02/17 17:02:50 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/02/17 17:02:50 INFO input.FileInputFormat: Total input paths to process : 1 13/02/17 17:02:50 WARN snappy.LoadSnappy: Snappy native library
Re: [Lustre-discuss] Need help on lustre..
Hi, I think you might be better served with your Hadoop setup by posting to the Hadoop discussion list. Once you have it setup and working, if you run into Lustre related issues, please feel free to post those here. Good luck! -cf On 02/17/2013 04:47 AM, linux freaker wrote: Great !!! I tried removing entry from mapred-site.xml and it seems to run well. Here are the logs now: [code] [root@alpha hadoop]# bin/hadoop jar hadoop-examples-1.1.1.jar wordcount /user/ha doop/hadoop/ /user/hadoop/hadoop/output 13/02/17 17:14:37 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/02/17 17:14:38 INFO input.FileInputFormat: Total input paths to process : 1 13/02/17 17:14:38 WARN snappy.LoadSnappy: Snappy native library not loaded 13/02/17 17:14:38 INFO mapred.JobClient: Running job: job_local_0001 13/02/17 17:14:38 INFO util.ProcessTree: setsid exited with exit code 0 13/02/17 17:14:38 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache .hadoop.util.LinuxResourceCalculatorPlugin@2f74219d 13/02/17 17:14:38 INFO mapred.MapTask: io.sort.mb = 100 13/02/17 17:14:38 INFO mapred.MapTask: data buffer = 79691776/99614720 13/02/17 17:14:38 INFO mapred.MapTask: record buffer = 262144/327680 13/02/17 17:14:38 INFO mapred.MapTask: Starting flush of map output 13/02/17 17:14:39 INFO mapred.JobClient: map 0% reduce 0% 13/02/17 17:14:39 INFO mapred.MapTask: Finished spill 0 13/02/17 17:14:39 INFO mapred.Task: Task:attempt_local_0001_m_00_0 is done. And is in the process of commiting 13/02/17 17:14:39 INFO mapred.LocalJobRunner: 13/02/17 17:14:39 INFO mapred.Task: Task 'attempt_local_0001_m_00_0' done. 13/02/17 17:14:39 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache .hadoop.util.LinuxResourceCalculatorPlugin@6d79953c 13/02/17 17:14:39 INFO mapred.LocalJobRunner: 13/02/17 17:14:39 INFO mapred.Merger: Merging 1 sorted segments 13/02/17 17:14:39 INFO mapred.Merger: Down to the last merge-pass, with 1 segmen ts left of total size: 79496 bytes 13/02/17 17:14:39 INFO mapred.LocalJobRunner: 13/02/17 17:14:39 INFO mapred.Task: Task:attempt_local_0001_r_00_0 is done. And is in the process of commiting 13/02/17 17:14:39 INFO mapred.LocalJobRunner: 13/02/17 17:14:39 INFO mapred.Task: Task attempt_local_0001_r_00_0 is allowe d to commit now 13/02/17 17:14:39 INFO output.FileOutputCommitter: Saved output of task 'attempt _local_0001_r_00_0' to /user/hadoop/hadoop/output 13/02/17 17:14:39 INFO mapred.LocalJobRunner: reduce reduce 13/02/17 17:14:39 INFO mapred.Task: Task 'attempt_local_0001_r_00_0' done. 13/02/17 17:14:40 INFO mapred.JobClient: map 100% reduce 100% 13/02/17 17:14:40 INFO mapred.JobClient: Job complete: job_local_0001 13/02/17 17:14:40 INFO mapred.JobClient: Counters: 20 13/02/17 17:14:40 INFO mapred.JobClient: File Output Format Counters 13/02/17 17:14:40 INFO mapred.JobClient: Bytes Written=57885 13/02/17 17:14:40 INFO mapred.JobClient: FileSystemCounters 13/02/17 17:14:40 INFO mapred.JobClient: FILE_BYTES_READ=643420 13/02/17 17:14:40 INFO mapred.JobClient: FILE_BYTES_WRITTEN=574349 13/02/17 17:14:40 INFO mapred.JobClient: File Input Format Counters 13/02/17 17:14:40 INFO mapred.JobClient: Bytes Read=139351 13/02/17 17:14:40 INFO mapred.JobClient: Map-Reduce Framework 13/02/17 17:14:40 INFO mapred.JobClient: Map output materialized bytes=79500 13/02/17 17:14:40 INFO mapred.JobClient: Map input records=2932 13/02/17 17:14:40 INFO mapred.JobClient: Reduce shuffle bytes=0 13/02/17 17:14:40 INFO mapred.JobClient: Spilled Records=11180 13/02/17 17:14:40 INFO mapred.JobClient: Map output bytes=212823 13/02/17 17:14:40 INFO mapred.JobClient: Total committed heap usage (bytes)= 500432896 13/02/17 17:14:40 INFO mapred.JobClient: CPU time spent (ms)=0 13/02/17 17:14:40 INFO mapred.JobClient: SPLIT_RAW_BYTES=99 13/02/17 17:14:40 INFO mapred.JobClient: Combine input records=21582 13/02/17 17:14:40 INFO mapred.JobClient: Reduce input records=5590 13/02/17 17:14:40 INFO mapred.JobClient: Reduce input groups=5590 13/02/17 17:14:40 INFO mapred.JobClient: Combine output records=5590 13/02/17 17:14:40 INFO mapred.JobClient: Physical memory (bytes) snapshot=0 13/02/17 17:14:40 INFO mapred.JobClient: Reduce output records=5590 13/02/17 17:14:40 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0 13/02/17 17:14:40 INFO mapred.JobClient: Map output records=21582 [root@alpha hadoop]# [/code] Does it mean hadoop over lustre is working fine? On 2/17/13, linux freaker linuxfrea...@gmail.com wrote: I tried running the below command but got the below error. I have not put it into HDFS since
Re: [Lustre-discuss] Need Help
Hi, Additional logging from the MDS and OSS's is required to really tell whats going on, that said you can try and verify that your OSS nodes can successfully contact your MDS and MGS nodes, lctl ping will indicate this. After that if you find they are successfully contacting each other you can try and abort recovery both on the MDT and OST's you're attempting to mount. (-o abort_recov mount option). -cf On 01/09/2012 04:00 AM, Patrice Hamelin wrote: Hi, I am getting that occasionnally and try to remount another time, which works. I am interested in finding out what's happenning too. Thanks. On 01/07/12 07:19, Ashok nulguda wrote: Dear All, We have Lustre 1.8.4 installed with 2 MDS servers and 2 OSS servers with 17 OSTes and 1 MDT with ha configured on both my MDS and OSS. problem:- Some of my OSTes are not mounting on my OSS servers. When i try to maunully mount it through errors failed: Transport endpoint is not connected commnd :-mount -t lustre /dev/mapper/.. /OST1 failed: Transport endpoint is not connected however, when we login and check MDS server for lustre ost status we found cat /proc/fs/lustre/mds/lustre-MDT/recovery_status It shows completed And also cat /proc/fs/lustre/devices All my mdt and ost are showing up status. Can anyone help us it debuging. Thanks and Regards Ashok -- *Ashok Nulguda * *TATA ELXSI LTD* *Mb : +91 9689945767 Mb : +91 9637095767 Land line : 2702044871 * *Email :ash...@tataelxsi.co.in mailto:tshrik...@tataelxsi.co.in* ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Patrice Hamelin Specialiste sénior en systèmes d'exploitation | Senior OS specialist Environnement Canada | Environment Canada 2121, route Transcanadienne | 2121 Transcanada Highway Dorval, QC H9P 1J3 Téléphone | Telephone 514-421-5303 Télécopieur | Facsimile 514-421-7231 Gouvernement du Canada | Government of Canada ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Need Help
Hi, I am getting that occasionnally and try to remount another time, which works. I am interested in finding out what's happenning too. Thanks. On 01/07/12 07:19, Ashok nulguda wrote: Dear All, We have Lustre 1.8.4 installed with 2 MDS servers and 2 OSS servers with 17 OSTes and 1 MDT with ha configured on both my MDS and OSS. problem:- Some of my OSTes are not mounting on my OSS servers. When i try to maunully mount it through errors failed: Transport endpoint is not connected commnd :-mount -t lustre /dev/mapper/.. /OST1 failed: Transport endpoint is not connected however, when we login and check MDS server for lustre ost status we found cat /proc/fs/lustre/mds/lustre-MDT/recovery_status It shows completed And also cat /proc/fs/lustre/devices All my mdt and ost are showing up status. Can anyone help us it debuging. Thanks and Regards Ashok -- *Ashok Nulguda * *TATA ELXSI LTD* *Mb : +91 9689945767 Mb : +91 9637095767 Land line : 2702044871 * *Email :ash...@tataelxsi.co.in mailto:tshrik...@tataelxsi.co.in* ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Patrice Hamelin Specialiste sénior en systèmes d'exploitation | Senior OS specialist Environnement Canada | Environment Canada 2121, route Transcanadienne | 2121 Transcanada Highway Dorval, QC H9P 1J3 Téléphone | Telephone 514-421-5303 Télécopieur | Facsimile 514-421-7231 Gouvernement du Canada | Government of Canada ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Need Help
Dear All, We have Lustre 1.8.4 installed with 2 MDS servers and 2 OSS servers with 17 OSTes and 1 MDT with ha configured on both my MDS and OSS. problem:- Some of my OSTes are not mounting on my OSS servers. When i try to maunully mount it through errors failed: Transport endpoint is not connected commnd :-mount -t lustre /dev/mapper/.. /OST1 failed: Transport endpoint is not connected however, when we login and check MDS server for lustre ost status we found cat /proc/fs/lustre/mds/lustre-MDT/recovery_status It shows completed And also cat /proc/fs/lustre/devices All my mdt and ost are showing up status. Can anyone help us it debuging. Thanks and Regards Ashok -- *Ashok Nulguda * *TATA ELXSI LTD* *Mb : +91 9689945767 Mb : +91 9637095767 Land line : 2702044871 * *Email :ash...@tataelxsi.co.in tshrik...@tataelxsi.co.in* ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Need Help
How are your OSTs connected to your OSSs? -cf -Original message- From: Ashok nulguda ashok0...@gmail.com To: Lustre Discussion list Lustre-discuss@lists.lustre.org Sent: Sat, Jan 7, 2012 00:19:59 MST Subject: [Lustre-discuss] Need Help ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Need help
Hi, I just upgraded our servers from RHEL 5.4 - RHEL 5.5 and went from lustre 1.8.3 to 1.8.5. Now when I try to mount the OSTs I'm getting: [root@aoss1 ~]# mount -t lustre /dev/disk/by-label/scratch2-OST0001 /mnt/lustre/local/scratch2-OST0001 mount.lustre: mount /dev/disk/by-label/scratch2-OST0001 at /mnt/lustre/local/scratch2-OST0001 failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) tunefs.lustre looks okay on both the MDT (which is mounted) and the OSTs: [root@amds1 ~]# tunefs.lustre /dev/disk/by-label/scratch2-MDT checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: scratch2-MDT Index: 0 Lustre FS: scratch2 Mount type: ldiskfs Flags: 0x5 (MDT MGS ) Persistent mount opts: errors=panic,iopen_nopriv,user_xattr,maxdirsize=2000 Parameters: lov.stripecount=4 failover.node=failnode@tcp1 failover.node=failnode@o2ib1 mdt.group_upcall=/usr/sbin/l_getgroups Permanent disk data: Target: scratch2-MDT Index: 0 Lustre FS: scratch2 Mount type: ldiskfs Flags: 0x5 (MDT MGS ) Persistent mount opts: errors=panic,iopen_nopriv,user_xattr,maxdirsize=2000 Parameters: lov.stripecount=4 failover.node=failnode@tcp1 failover.node=failnode@o2ib1 mdt.group_upcall=/usr/sbin/l_getgroups exiting before disk write. [root@aoss1 ~]# tunefs.lustre /dev/disk/by-label/scratch2-OST0001 checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: scratch2-OST0001 Index: 1 Lustre FS: scratch2 Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=panic,extents,mballoc Parameters: mgsnode=mds-server1@tcp1 mgsnode=mds-server1@o2ib1 mgsnode=mds-server2@tcp1 mgsnode=mds-server2@o2ib1 failover.node=failnode@tcp1 failover.node=failnode@o2ib1 Permanent disk data: Target: scratch2-OST0001 Index: 1 Lustre FS: scratch2 Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=panic,extents,mballoc Parameters: mgsnode=mds-server1@tcp1 mgsnode=mds-server1@o2ib1 mgsnode=mds-server2@tcp1 mgsnode=mds-server2@o2ib1 failover.node=falnode@tcp1 failover.node=failnode@o2ib1 exiting before disk write. I am really stuck and could really use some help. Thanks. == Joe Mervini Sandia National Laboratories Dept 09326 PO Box 5800 MS-0823 Albuquerque NM 87185-0823 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Need help
Did you also install the correct e2fsprogs? cliffw On Fri, Jul 1, 2011 at 5:45 PM, Mervini, Joseph A jame...@sandia.govwrote: Hi, I just upgraded our servers from RHEL 5.4 - RHEL 5.5 and went from lustre 1.8.3 to 1.8.5. Now when I try to mount the OSTs I'm getting: [root@aoss1 ~]# mount -t lustre /dev/disk/by-label/scratch2-OST0001 /mnt/lustre/local/scratch2-OST0001 mount.lustre: mount /dev/disk/by-label/scratch2-OST0001 at /mnt/lustre/local/scratch2-OST0001 failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) tunefs.lustre looks okay on both the MDT (which is mounted) and the OSTs: [root@amds1 ~]# tunefs.lustre /dev/disk/by-label/scratch2-MDT checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: scratch2-MDT Index: 0 Lustre FS: scratch2 Mount type: ldiskfs Flags: 0x5 (MDT MGS ) Persistent mount opts: errors=panic,iopen_nopriv,user_xattr,maxdirsize=2000 Parameters: lov.stripecount=4 failover.node=failnode@tcp1 failover.node=failnode@o2ib1 mdt.group_upcall=/usr/sbin/l_getgroups Permanent disk data: Target: scratch2-MDT Index: 0 Lustre FS: scratch2 Mount type: ldiskfs Flags: 0x5 (MDT MGS ) Persistent mount opts: errors=panic,iopen_nopriv,user_xattr,maxdirsize=2000 Parameters: lov.stripecount=4 failover.node=failnode@tcp1 failover.node=failnode@o2ib1 mdt.group_upcall=/usr/sbin/l_getgroups exiting before disk write. [root@aoss1 ~]# tunefs.lustre /dev/disk/by-label/scratch2-OST0001 checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: scratch2-OST0001 Index: 1 Lustre FS: scratch2 Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=panic,extents,mballoc Parameters: mgsnode=mds-server1@tcp1 mgsnode=mds-server1@o2ib1 mgsnode=mds-server2@tcp1 mgsnode=mds-server2@o2ib1 failover.node=failnode@tcp1 failover.node=failnode@o2ib1 Permanent disk data: Target: scratch2-OST0001 Index: 1 Lustre FS: scratch2 Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=panic,extents,mballoc Parameters: mgsnode=mds-server1@tcp1 mgsnode=mds-server1@o2ib1 mgsnode=mds-server2@tcp1 mgsnode=mds-server2@o2ib1 failover.node=falnode@tcp1 failover.node=failnode@o2ib1 exiting before disk write. I am really stuck and could really use some help. Thanks. == Joe Mervini Sandia National Laboratories Dept 09326 PO Box 5800 MS-0823 Albuquerque NM 87185-0823 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- cliffw Support Guy WhamCloud, Inc. www.whamcloud.com ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] need help
Ashok nulguda wrote: Dear All, How to forcefully shutdown the luster service from client and OST and MDS server when IO are opening. For the servers, you can just umount them. There will not be any file system corruption, but files will not have the latest data -- the cache on the clients will not be written to disk (unless recovery happens -- restart the servers without having rebooted the clients). In an emergency, this is normally all you have time to do before shutting down the system. To unmount clients, not only can there not be any IO, you also need to first kill every process that has an open file on Lustre. lsof can be useful here if you don't want to do a full shutdown, but in many environments killing non-system processes is enough. Normally you'd want to shutdown all the clients, and then the servers. Kevin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] need help
Dear All, How to forcefully shutdown the luster service from client and OST and MDS server when IO are opening. Thanks and Regards Ashok -- Ashok Y. Nulguda System Administrator Tata Elxsi, Pune mobile:-+91-9689945767 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Need Help
Dear All, How to forcefully shutdown the luster. Thanks and Regards Ashok -- Ashok Y. Nulguda System Administrator Tata Elxsi, Pune mobile:-+91-9689945767 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] need help debuggin an access permission problem
Cheers Andreas. I had actually found that, but there doesn't seem to be that much documentation about it. Or I didn't find it :) Plus it appeared to find the users that were problematic whenever I tried it, so I wondered if that is all there is, or if there's some other mechanism I could test for. Tina On 23/09/10 22:25, Andreas Dilger wrote: On 2010-09-23, at 08:03, Tina Friedrich wrote: Still - could someone point me to the bit in the documentation that best describes how the MDS queries that sort of information (group/passwd info, I mean)? Or how to best test that it's mechanisms are working? For example, in this case, I always thought one would only hit the size limit if doing a bulk 'transfer' of data, not doing a lookup on one user - plus I could do these sort lookups fine on all machines involved (against all ldap servers). You can run l_getgroups -d {uid} (the utility that the MDS uses to query the groups database/LDAP) from the command-line. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc. -- Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd Diamond House, Harwell Science and Innovation Campus - 01235 77 8442 -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] need help debuggin an access permission problem
Hi! On Fri, Sep 24, 2010 at 09:18:15AM +0100, Tina Friedrich wrote: Cheers Andreas. I had actually found that, but there doesn't seem to be that much documentation about it. Or I didn't find it :) Plus it appeared to find the users that were problematic whenever I tried it, so I wondered if that is all there is, or if there's some other mechanism I could test for. Mind that access to cached files is no longer authorized by the MDS, but by the client itself. I wouldn't call it documentation, but http://wiki.lustre.org/images/b/ba/Tuesday_lustre_automotive.pdf has an illustration of why this is a problem when nameservices become out of sync between MDS and Lustre clients (slides 23/24). Sounds like you hit a very similar issue. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] need help debuggin an access permission problem
Actually, what I hit was one of the LDAP server private to the MDS errounously had a size limit set where the others are unlimited. They're round robin'd which is why I was seeing an inermittent effect. So not a client issue, the clients would not have used this server for their lookups. Which is why I'm puzzled as to how this works, and trying to understand it a bit better; to my understanding, this should not affect lookups on single users, only 'bulk' transfers of data, at least as I understand this? Tina On 24/09/10 12:35, Daniel Kobras wrote: Hi! On Fri, Sep 24, 2010 at 09:18:15AM +0100, Tina Friedrich wrote: Cheers Andreas. I had actually found that, but there doesn't seem to be that much documentation about it. Or I didn't find it :) Plus it appeared to find the users that were problematic whenever I tried it, so I wondered if that is all there is, or if there's some other mechanism I could test for. Mind that access to cached files is no longer authorized by the MDS, but by the client itself. I wouldn't call it documentation, but http://wiki.lustre.org/images/b/ba/Tuesday_lustre_automotive.pdf has an illustration of why this is a problem when nameservices become out of sync between MDS and Lustre clients (slides 23/24). Sounds like you hit a very similar issue. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd Diamond House, Harwell Science and Innovation Campus - 01235 77 8442 -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] need help debuggin an access permission problem
I think there is a bit of confusion here. The MDS is doing the initial authorization for the file, using l_getgroups to access the group information from LDAP (or whatever database is used). Daniel's point was that after the client has gotten access to the file, it will cache this file locally until the lock is dropped from the client. Cheers, Andreas On 2010-09-24, at 7:58, Tina Friedrich tina.friedr...@diamond.ac.uk wrote: Actually, what I hit was one of the LDAP server private to the MDS errounously had a size limit set where the others are unlimited. They're round robin'd which is why I was seeing an inermittent effect. So not a client issue, the clients would not have used this server for their lookups. Which is why I'm puzzled as to how this works, and trying to understand it a bit better; to my understanding, this should not affect lookups on single users, only 'bulk' transfers of data, at least as I understand this? Tina On 24/09/10 12:35, Daniel Kobras wrote: Hi! On Fri, Sep 24, 2010 at 09:18:15AM +0100, Tina Friedrich wrote: Cheers Andreas. I had actually found that, but there doesn't seem to be that much documentation about it. Or I didn't find it :) Plus it appeared to find the users that were problematic whenever I tried it, so I wondered if that is all there is, or if there's some other mechanism I could test for. Mind that access to cached files is no longer authorized by the MDS, but by the client itself. I wouldn't call it documentation, but http://wiki.lustre.org/images/b/ba/Tuesday_lustre_automotive.pdf has an illustration of why this is a problem when nameservices become out of sync between MDS and Lustre clients (slides 23/24). Sounds like you hit a very similar issue. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd Diamond House, Harwell Science and Innovation Campus - 01235 77 8442 -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] need help debuggin an access permission problem
In fact, the issues occurred when MDS does the upcall (default processed by user space l_getgroups) for user/group information related with this RPC, one UID for each upcall, and all the supplementary groups (not more than sysconf(_SC_NGROUPS_MAX) count) of this UID will be returned. The whole process is not nothing related with single user or not. If it is the improper configuration (of LDAP) for some user(s) caused the failure, you have to verify all the users one by one. Cheers, Nasf On 9/24/10 9:58 PM, Tina Friedrich wrote: Actually, what I hit was one of the LDAP server private to the MDS errounously had a size limit set where the others are unlimited. They're round robin'd which is why I was seeing an inermittent effect. So not a client issue, the clients would not have used this server for their lookups. Which is why I'm puzzled as to how this works, and trying to understand it a bit better; to my understanding, this should not affect lookups on single users, only 'bulk' transfers of data, at least as I understand this? Tina On 24/09/10 12:35, Daniel Kobras wrote: Hi! On Fri, Sep 24, 2010 at 09:18:15AM +0100, Tina Friedrich wrote: Cheers Andreas. I had actually found that, but there doesn't seem to be that much documentation about it. Or I didn't find it :) Plus it appeared to find the users that were problematic whenever I tried it, so I wondered if that is all there is, or if there's some other mechanism I could test for. Mind that access to cached files is no longer authorized by the MDS, but by the client itself. I wouldn't call it documentation, but http://wiki.lustre.org/images/b/ba/Tuesday_lustre_automotive.pdf has an illustration of why this is a problem when nameservices become out of sync between MDS and Lustre clients (slides 23/24). Sounds like you hit a very similar issue. Regards, Daniel. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] need help debuggin an access permission problem
Hello List, I'm after debugging hints... I have a couple of users that intermittently get I/O errors when trying to ls a directory (as in, within half an hour, works - doesn't work - works...). Users/groups are kept in ldap; as far as I can see/check, the ldap information is consistend everywhere (i.e. no replication failure or anything). I am trying to figure out what is going on here/where this is going wrong. Can someone give me a hint on how to debug this? Specifically, how does the MDS look up this sort of information, could there be a 'list too long' type of error involved, something like that? Thanks, Tina -- Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd Diamond House, Harwell Science and Innovation Campus - 01235 77 8442 -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] need help debuggin an access permission problem
Hi, thanks for the answer. I found it in the meantime; one of our ldap servers had a wrong size limit entry. The logs I had of course already looked at - they didn't yield much in terms of why, only what (as in, I could see it was permission errors, but they do of course not really tell you why you are getting them. There weren't any log entries that hinted at 'size limit exceeded' or anything.). Still - could someone point me to the bit in the documentation that best describes how the MDS queries that sort of information (group/passwd info, I mean)? Or how to best test that it's mechanisms are working? For example, in this case, I always thought one would only hit the size limit if doing a bulk 'transfer' of data, not doing a lookup on one user - plus I could do these sort lookups fine on all machines involved (against all ldap servers). Tina On 23/09/10 11:20, Ashley Pittman wrote: On 23 Sep 2010, at 10:46, Tina Friedrich wrote: Hello List, I'm after debugging hints... I have a couple of users that intermittently get I/O errors when trying to ls a directory (as in, within half an hour, works - doesn't work - works...). Users/groups are kept in ldap; as far as I can see/check, the ldap information is consistend everywhere (i.e. no replication failure or anything). I am trying to figure out what is going on here/where this is going wrong. Can someone give me a hint on how to debug this? Specifically, how does the MDS look up this sort of information, could there be a 'list too long' type of error involved, something like that? Could you give an indication as to the number of files in the directory concerned? What is the full ls command issued (allowing for shell aliases) and in the case where it works is there a large variation in the time it takes when it does work? In terms of debugging it I'd say the log files for the client in question and the MDS would be the most likely place to start. Ashley, -- Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd Diamond House, Harwell Science and Innovation Campus - 01235 77 8442 -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] need help debuggin an access permission problem
On 9/23/10 10:03 PM, Tina Friedrich wrote: Hi, thanks for the answer. I found it in the meantime; one of our ldap servers had a wrong size limit entry. The logs I had of course already looked at - they didn't yield much in terms of why, only what (as in, I could see it was permission errors, but they do of course not really tell you why you are getting them. There weren't any log entries that hinted at 'size limit exceeded' or anything.). Still - could someone point me to the bit in the documentation that best describes how the MDS queries that sort of information (group/passwd info, I mean)? Or how to best test that it's mechanisms are working? For example, in this case, I always thought one would only hit the size limit if doing a bulk 'transfer' of data, not doing a lookup on one user - plus I could do these sort lookups fine on all machines involved (against all ldap servers). The topic about User/Group Cache Upcall maybe helpful for you. For lustre-1.8.x, it is chapter of 28.1; for lustre-2.0.x, it is chapter of 29.1. Good Luck! Cheers, Nasf Tina On 23/09/10 11:20, Ashley Pittman wrote: On 23 Sep 2010, at 10:46, Tina Friedrich wrote: Hello List, I'm after debugging hints... I have a couple of users that intermittently get I/O errors when trying to ls a directory (as in, within half an hour, works - doesn't work - works...). Users/groups are kept in ldap; as far as I can see/check, the ldap information is consistend everywhere (i.e. no replication failure or anything). I am trying to figure out what is going on here/where this is going wrong. Can someone give me a hint on how to debug this? Specifically, how does the MDS look up this sort of information, could there be a 'list too long' type of error involved, something like that? Could you give an indication as to the number of files in the directory concerned? What is the full ls command issued (allowing for shell aliases) and in the case where it works is there a large variation in the time it takes when it does work? In terms of debugging it I'd say the log files for the client in question and the MDS would be the most likely place to start. Ashley, ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] need help debuggin an access permission problem
On 2010-09-23, at 08:03, Tina Friedrich wrote: Still - could someone point me to the bit in the documentation that best describes how the MDS queries that sort of information (group/passwd info, I mean)? Or how to best test that it's mechanisms are working? For example, in this case, I always thought one would only hit the size limit if doing a bulk 'transfer' of data, not doing a lookup on one user - plus I could do these sort lookups fine on all machines involved (against all ldap servers). You can run l_getgroups -d {uid} (the utility that the MDS uses to query the groups database/LDAP) from the command-line. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss