[Lustre-discuss] Need help on lustre filesystem setup..

2013-03-18 Thread linux freaker
Hi,

I am trying to run Apache Hadoop project on parallel filesystem like
lustre. I have 1 MDS, 2 OSS/OST and 1 Lustre Client.

My lustre client shows:
Code:
[root@lustreclient1 ~]# lfs df -h
UUID   bytesUsed   Available Use% Mounted on
lustre-MDT_UUID 4.5G  274.3M3.9G   6% /mnt/lustre[MDT:0]
lustre-OST_UUID 5.9G  276.1M5.3G   5% /mnt/lustre[OST:0]
lustre-OST0001_UUID 5.9G  276.1M5.3G   5% /mnt/lustre[OST:1]
lustre-OST0002_UUID 5.9G  276.1M5.3G   5% /mnt/lustre[OST:2]
lustre-OST0003_UUID 5.9G  276.1M5.3G   5% /mnt/lustre[OST:3]
lustre-OST0004_UUID 5.9G  276.1M5.3G   5% /mnt/lustre[OST:4]
lustre-OST0005_UUID 5.9G  276.1M5.3G   5% /mnt/lustre[OST:5]
lustre-OST0006_UUID 5.9G  276.1M5.3G   5% /mnt/lustre[OST:6]
lustre-OST0007_UUID 5.9G  276.1M5.3G   5% /mnt/lustre[OST:7]
lustre-OST0008_UUID 5.9G  276.1M5.3G   5% /mnt/lustre[OST:8]
lustre-OST0009_UUID 5.9G  276.1M5.3G   5% /mnt/lustre[OST:9]
lustre-OST000a_UUID 5.9G  276.1M5.3G   5%
/mnt/lustre[OST:10]
lustre-OST000b_UUID 5.9G  276.1M5.3G   5%
/mnt/lustre[OST:11]

filesystem summary:70.9G3.2G   64.0G   5% /mnt/lustre
As I was unsure about which machine I need to install Hadoop
softwareon, I decided to go ahead with installing Hadoop on
LustreClient1.

I configured LustreClient1 with JAVA_HOME and HADOOP parameter with
the following files entry:
File: conf/core-site.xml
Code:
property
namefs.default.name/name
valuefile:///mnt/lustre/value
/property
property
namemapred.system.dir/name
value${fs.default.name}/hadoop_tmp/mapred/system/value
descriptionThe shared directory where MapReduce stores control
files.
/description
/property
I dint make changes in mapred-site.xml.

Now when I start 'bin/start-mapred.sh' which tried to ssh to my own
local machine. I am not sure if I am doing right.

Doubt Do I need to have two Lustre Client for this to work?

Then I tried running wordcount program shown below:

Code:
 bin/hadoop jar hadoop-examples-1.1.1.jar wordcount /tmp/rahul
/tmp/rahul/rahul-output

ied 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
SECONDS)
13/03/14 18:12:29 INFO ipc.Client: Retrying connect to server:
10.94.214.188/10.94.214.188:54311. Already tried 1 time(s); retry
policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1 SECONDS)
13/03/14 18:12:30 INFO ipc.Client: Retrying connect to server:
10.94.214.188/10.94.214.188:54311. Already tried 2 time(s); retry
policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1 SECONDS)
13/03/14 18:12:31 INFO ipc.Client: Retrying connect to server:
10.94.214.188/10.94.214.188:54311. Already tried 3 time(s); retry
policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1 SECONDS)
13/03/14 18:12:32 INFO ipc.Client: Retrying connect to server:
10.94.214.188/10.94.214.188:54311. Already tried 4 time(s); retry
policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
sleepTime=1 SECONDS)
Question:1. As I have been comparing HDFS and Lustre for Hadoop, what
would be the right number of hardware nodes to compare?Say, I have 1
MDS, 2 OSS and 1 Lustre Client, on the other hand, 1 Namenode and 3
datanodes? How can I compare both FS?
Question:2. Do I really need 2 lustre client to setup Hadoop over
Lustre? if it is possible, how can I use OSS and MDS too for Hadoop
setup?
Question:3. As I read regarding the wordcount example, we need to
insert data into HDFS filesystem, do we need to do same for Lustre
too? Whats the command?
Question:4. What are the steps to confirm if HAdoop is actually using lustre FS?
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Need help on lustre..

2013-02-17 Thread linux freaker
I tried running the below command but got the below error.
I have not put it into HDFS since Lustre is what I am trying to implement with.

[code]
#bin/hadoop jar hadoop-examples-1.1.1.jar wordcount
/user/hadoop/hadoop /user/hadoop-output

13/02/17 17:02:50 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/02/17 17:02:50 INFO input.FileInputFormat: Total input paths to process : 1
13/02/17 17:02:50 WARN snappy.LoadSnappy: Snappy native library not loaded
13/02/17 17:02:50 INFO mapred.JobClient: Cleaning up the staging area
file:/tmp/

hadoop-hadoop/mapred/staging/root/.staging/job_201302161113_0004
13/02/17 17:02:50 ERROR security.UserGroupInformation:
PriviledgedActionExceptio
  n as:root
cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException: java

.io.FileNotFoundException: File
file:/tmp/hadoop-hadoop/mapred/staging/root/.sta

ging/job_201302161113_0004/job.xml does not exist.
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3731)
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3695)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.

java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces

sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma

tion.java:1136)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
Caused by: java.io.FileNotFoundException: File
file:/tmp/hadoop-hadoop/mapred/st

aging/root/.staging/job_201302161113_0004/job.xml does not exist.
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSys

tem.java:397)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.

java:251)
at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:406)
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3729)
... 12 more

org.apache.hadoop.ipc.RemoteException: java.io.IOException:
java.io.FileNotFound
 Exception: File
file:/tmp/hadoop-hadoop/mapred/staging/root/.staging/job_2013021

  61113_0004/job.xml does not exist.
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3731)
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3695)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.

java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces

sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma

tion.java:1136)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
Caused by: java.io.FileNotFoundException: File
file:/tmp/hadoop-hadoop/mapred/st

aging/root/.staging/job_201302161113_0004/job.xml does not exist.
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSys

tem.java:397)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.

java:251)
at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:406)
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3729)
... 12 more

at org.apache.hadoop.ipc.Client.call(Client.java:1107)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
at org.apache.hadoop.mapred.$Proxy1.submitJob(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.

java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces

sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at 

Re: [Lustre-discuss] Need help on lustre..

2013-02-17 Thread linux freaker
Great !!! I tried removing entry from mapred-site.xml and it seems to run well.

Here are the logs now:

[code]
[root@alpha hadoop]# bin/hadoop jar hadoop-examples-1.1.1.jar
wordcount /user/ha
   doop/hadoop/
/user/hadoop/hadoop/output
13/02/17 17:14:37 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/02/17 17:14:38 INFO input.FileInputFormat: Total input paths to process : 1
13/02/17 17:14:38 WARN snappy.LoadSnappy: Snappy native library not loaded
13/02/17 17:14:38 INFO mapred.JobClient: Running job: job_local_0001
13/02/17 17:14:38 INFO util.ProcessTree: setsid exited with exit code 0
13/02/17 17:14:38 INFO mapred.Task:  Using ResourceCalculatorPlugin :
org.apache

.hadoop.util.LinuxResourceCalculatorPlugin@2f74219d
13/02/17 17:14:38 INFO mapred.MapTask: io.sort.mb = 100
13/02/17 17:14:38 INFO mapred.MapTask: data buffer = 79691776/99614720
13/02/17 17:14:38 INFO mapred.MapTask: record buffer = 262144/327680
13/02/17 17:14:38 INFO mapred.MapTask: Starting flush of map output
13/02/17 17:14:39 INFO mapred.JobClient:  map 0% reduce 0%
13/02/17 17:14:39 INFO mapred.MapTask: Finished spill 0
13/02/17 17:14:39 INFO mapred.Task: Task:attempt_local_0001_m_00_0
is done.
  And is in the process of commiting
13/02/17 17:14:39 INFO mapred.LocalJobRunner:
13/02/17 17:14:39 INFO mapred.Task: Task 'attempt_local_0001_m_00_0' done.
13/02/17 17:14:39 INFO mapred.Task:  Using ResourceCalculatorPlugin :
org.apache

.hadoop.util.LinuxResourceCalculatorPlugin@6d79953c
13/02/17 17:14:39 INFO mapred.LocalJobRunner:
13/02/17 17:14:39 INFO mapred.Merger: Merging 1 sorted segments
13/02/17 17:14:39 INFO mapred.Merger: Down to the last merge-pass,
with 1 segmen
  ts left of total size: 79496 bytes
13/02/17 17:14:39 INFO mapred.LocalJobRunner:
13/02/17 17:14:39 INFO mapred.Task: Task:attempt_local_0001_r_00_0
is done.
  And is in the process of commiting
13/02/17 17:14:39 INFO mapred.LocalJobRunner:
13/02/17 17:14:39 INFO mapred.Task: Task attempt_local_0001_r_00_0
is allowe
  d to commit now
13/02/17 17:14:39 INFO output.FileOutputCommitter: Saved output of
task 'attempt
  _local_0001_r_00_0' to
/user/hadoop/hadoop/output
13/02/17 17:14:39 INFO mapred.LocalJobRunner: reduce  reduce
13/02/17 17:14:39 INFO mapred.Task: Task 'attempt_local_0001_r_00_0' done.
13/02/17 17:14:40 INFO mapred.JobClient:  map 100% reduce 100%
13/02/17 17:14:40 INFO mapred.JobClient: Job complete: job_local_0001
13/02/17 17:14:40 INFO mapred.JobClient: Counters: 20
13/02/17 17:14:40 INFO mapred.JobClient:   File Output Format Counters
13/02/17 17:14:40 INFO mapred.JobClient: Bytes Written=57885
13/02/17 17:14:40 INFO mapred.JobClient:   FileSystemCounters
13/02/17 17:14:40 INFO mapred.JobClient: FILE_BYTES_READ=643420
13/02/17 17:14:40 INFO mapred.JobClient: FILE_BYTES_WRITTEN=574349
13/02/17 17:14:40 INFO mapred.JobClient:   File Input Format Counters
13/02/17 17:14:40 INFO mapred.JobClient: Bytes Read=139351
13/02/17 17:14:40 INFO mapred.JobClient:   Map-Reduce Framework
13/02/17 17:14:40 INFO mapred.JobClient: Map output materialized bytes=79500
13/02/17 17:14:40 INFO mapred.JobClient: Map input records=2932
13/02/17 17:14:40 INFO mapred.JobClient: Reduce shuffle bytes=0
13/02/17 17:14:40 INFO mapred.JobClient: Spilled Records=11180
13/02/17 17:14:40 INFO mapred.JobClient: Map output bytes=212823
13/02/17 17:14:40 INFO mapred.JobClient: Total committed heap
usage (bytes)=
   500432896
13/02/17 17:14:40 INFO mapred.JobClient: CPU time spent (ms)=0
13/02/17 17:14:40 INFO mapred.JobClient: SPLIT_RAW_BYTES=99
13/02/17 17:14:40 INFO mapred.JobClient: Combine input records=21582
13/02/17 17:14:40 INFO mapred.JobClient: Reduce input records=5590
13/02/17 17:14:40 INFO mapred.JobClient: Reduce input groups=5590
13/02/17 17:14:40 INFO mapred.JobClient: Combine output records=5590
13/02/17 17:14:40 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
13/02/17 17:14:40 INFO mapred.JobClient: Reduce output records=5590
13/02/17 17:14:40 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
13/02/17 17:14:40 INFO mapred.JobClient: Map output records=21582
[root@alpha hadoop]#

[/code]

Does it mean hadoop over lustre is working fine?

On 2/17/13, linux freaker linuxfrea...@gmail.com wrote:
 I tried running the below command but got the below error.
 I have not put it into HDFS since Lustre is what I am trying to implement
 with.

 [code]
 #bin/hadoop jar hadoop-examples-1.1.1.jar wordcount
 /user/hadoop/hadoop /user/hadoop-output

 13/02/17 17:02:50 INFO util.NativeCodeLoader: Loaded the native-hadoop
 library
 13/02/17 17:02:50 INFO input.FileInputFormat: Total input paths to process :
 1
 13/02/17 17:02:50 WARN snappy.LoadSnappy: Snappy native library 

Re: [Lustre-discuss] Need help on lustre..

2013-02-17 Thread Colin Faber
Hi,

I think you might be better served with your Hadoop setup by posting to 
the Hadoop discussion list. Once you have it setup and working, if you 
run into Lustre related issues, please feel free to post those here.

Good luck!

-cf

On 02/17/2013 04:47 AM, linux freaker wrote:
 Great !!! I tried removing entry from mapred-site.xml and it seems to run 
 well.

 Here are the logs now:

 [code]
 [root@alpha hadoop]# bin/hadoop jar hadoop-examples-1.1.1.jar
 wordcount /user/ha
 doop/hadoop/
 /user/hadoop/hadoop/output
 13/02/17 17:14:37 INFO util.NativeCodeLoader: Loaded the native-hadoop library
 13/02/17 17:14:38 INFO input.FileInputFormat: Total input paths to process : 1
 13/02/17 17:14:38 WARN snappy.LoadSnappy: Snappy native library not loaded
 13/02/17 17:14:38 INFO mapred.JobClient: Running job: job_local_0001
 13/02/17 17:14:38 INFO util.ProcessTree: setsid exited with exit code 0
 13/02/17 17:14:38 INFO mapred.Task:  Using ResourceCalculatorPlugin :
 org.apache

 .hadoop.util.LinuxResourceCalculatorPlugin@2f74219d
 13/02/17 17:14:38 INFO mapred.MapTask: io.sort.mb = 100
 13/02/17 17:14:38 INFO mapred.MapTask: data buffer = 79691776/99614720
 13/02/17 17:14:38 INFO mapred.MapTask: record buffer = 262144/327680
 13/02/17 17:14:38 INFO mapred.MapTask: Starting flush of map output
 13/02/17 17:14:39 INFO mapred.JobClient:  map 0% reduce 0%
 13/02/17 17:14:39 INFO mapred.MapTask: Finished spill 0
 13/02/17 17:14:39 INFO mapred.Task: Task:attempt_local_0001_m_00_0
 is done.
And is in the process of commiting
 13/02/17 17:14:39 INFO mapred.LocalJobRunner:
 13/02/17 17:14:39 INFO mapred.Task: Task 'attempt_local_0001_m_00_0' done.
 13/02/17 17:14:39 INFO mapred.Task:  Using ResourceCalculatorPlugin :
 org.apache

 .hadoop.util.LinuxResourceCalculatorPlugin@6d79953c
 13/02/17 17:14:39 INFO mapred.LocalJobRunner:
 13/02/17 17:14:39 INFO mapred.Merger: Merging 1 sorted segments
 13/02/17 17:14:39 INFO mapred.Merger: Down to the last merge-pass,
 with 1 segmen
ts left of total size: 79496 bytes
 13/02/17 17:14:39 INFO mapred.LocalJobRunner:
 13/02/17 17:14:39 INFO mapred.Task: Task:attempt_local_0001_r_00_0
 is done.
And is in the process of commiting
 13/02/17 17:14:39 INFO mapred.LocalJobRunner:
 13/02/17 17:14:39 INFO mapred.Task: Task attempt_local_0001_r_00_0
 is allowe
d to commit now
 13/02/17 17:14:39 INFO output.FileOutputCommitter: Saved output of
 task 'attempt
_local_0001_r_00_0' to
 /user/hadoop/hadoop/output
 13/02/17 17:14:39 INFO mapred.LocalJobRunner: reduce  reduce
 13/02/17 17:14:39 INFO mapred.Task: Task 'attempt_local_0001_r_00_0' done.
 13/02/17 17:14:40 INFO mapred.JobClient:  map 100% reduce 100%
 13/02/17 17:14:40 INFO mapred.JobClient: Job complete: job_local_0001
 13/02/17 17:14:40 INFO mapred.JobClient: Counters: 20
 13/02/17 17:14:40 INFO mapred.JobClient:   File Output Format Counters
 13/02/17 17:14:40 INFO mapred.JobClient: Bytes Written=57885
 13/02/17 17:14:40 INFO mapred.JobClient:   FileSystemCounters
 13/02/17 17:14:40 INFO mapred.JobClient: FILE_BYTES_READ=643420
 13/02/17 17:14:40 INFO mapred.JobClient: FILE_BYTES_WRITTEN=574349
 13/02/17 17:14:40 INFO mapred.JobClient:   File Input Format Counters
 13/02/17 17:14:40 INFO mapred.JobClient: Bytes Read=139351
 13/02/17 17:14:40 INFO mapred.JobClient:   Map-Reduce Framework
 13/02/17 17:14:40 INFO mapred.JobClient: Map output materialized 
 bytes=79500
 13/02/17 17:14:40 INFO mapred.JobClient: Map input records=2932
 13/02/17 17:14:40 INFO mapred.JobClient: Reduce shuffle bytes=0
 13/02/17 17:14:40 INFO mapred.JobClient: Spilled Records=11180
 13/02/17 17:14:40 INFO mapred.JobClient: Map output bytes=212823
 13/02/17 17:14:40 INFO mapred.JobClient: Total committed heap
 usage (bytes)=
 500432896
 13/02/17 17:14:40 INFO mapred.JobClient: CPU time spent (ms)=0
 13/02/17 17:14:40 INFO mapred.JobClient: SPLIT_RAW_BYTES=99
 13/02/17 17:14:40 INFO mapred.JobClient: Combine input records=21582
 13/02/17 17:14:40 INFO mapred.JobClient: Reduce input records=5590
 13/02/17 17:14:40 INFO mapred.JobClient: Reduce input groups=5590
 13/02/17 17:14:40 INFO mapred.JobClient: Combine output records=5590
 13/02/17 17:14:40 INFO mapred.JobClient: Physical memory (bytes) 
 snapshot=0
 13/02/17 17:14:40 INFO mapred.JobClient: Reduce output records=5590
 13/02/17 17:14:40 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
 13/02/17 17:14:40 INFO mapred.JobClient: Map output records=21582
 [root@alpha hadoop]#

 [/code]

 Does it mean hadoop over lustre is working fine?

 On 2/17/13, linux freaker linuxfrea...@gmail.com wrote:
 I tried running the below command but got the below error.
 I have not put it into HDFS since 

Re: [Lustre-discuss] Need Help

2012-01-11 Thread Colin Faber
Hi,

Additional logging from the MDS and OSS's is required to really tell 
whats going on, that said you can try and verify that your OSS nodes can 
successfully contact your MDS and MGS nodes, lctl ping will indicate 
this. After that if you find they are successfully contacting each other 
you can try and abort recovery both on the MDT and OST's you're 
attempting to mount. (-o abort_recov mount option).

-cf


On 01/09/2012 04:00 AM, Patrice Hamelin wrote:
 Hi,

   I am getting that occasionnally and try to remount another time, 
 which works.  I am interested in finding out what's happenning too.

 Thanks.

 On 01/07/12 07:19, Ashok nulguda wrote:
 Dear All,

 We have Lustre 1.8.4 installed with 2 MDS servers and 2 OSS servers 
 with 17 OSTes and 1 MDT with ha configured on both my MDS and OSS.
 problem:-
 Some of my OSTes are not mounting on my OSS servers.
 When i try to maunully mount it  through errors  failed: Transport 
 endpoint is not connected
 commnd :-mount -t lustre /dev/mapper/..   /OST1
  failed: Transport endpoint is not connected

 however, when we login and check MDS server for lustre ost status we 
 found
 cat /proc/fs/lustre/mds/lustre-MDT/recovery_status
 It shows completed
 And also
 cat /proc/fs/lustre/devices
 All my mdt and ost are showing up status.

 Can anyone help us it debuging.


 Thanks and Regards
 Ashok

 -- 
 *Ashok Nulguda
 *
 *TATA ELXSI LTD*
 *Mb : +91 9689945767
 Mb : +91 9637095767
 Land line : 2702044871
 *
 *Email :ash...@tataelxsi.co.in mailto:tshrik...@tataelxsi.co.in*


 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

 -- 
 Patrice Hamelin
 Specialiste sénior en systèmes d'exploitation | Senior OS specialist
 Environnement Canada | Environment Canada
 2121, route Transcanadienne | 2121 Transcanada Highway
 Dorval, QC H9P 1J3
 Téléphone | Telephone 514-421-5303
 Télécopieur | Facsimile 514-421-7231
 Gouvernement du Canada | Government of Canada


 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Need Help

2012-01-09 Thread Patrice Hamelin

Hi,

  I am getting that occasionnally and try to remount another time, 
which works.  I am interested in finding out what's happenning too.


Thanks.

On 01/07/12 07:19, Ashok nulguda wrote:

Dear All,

We have Lustre 1.8.4 installed with 2 MDS servers and 2 OSS servers 
with 17 OSTes and 1 MDT with ha configured on both my MDS and OSS.

problem:-
Some of my OSTes are not mounting on my OSS servers.
When i try to maunully mount it  through errors  failed: Transport 
endpoint is not connected

commnd :-mount -t lustre /dev/mapper/..   /OST1
 failed: Transport endpoint is not connected

however, when we login and check MDS server for lustre ost status we found
cat /proc/fs/lustre/mds/lustre-MDT/recovery_status
It shows completed
And also
cat /proc/fs/lustre/devices
All my mdt and ost are showing up status.

Can anyone help us it debuging.


Thanks and Regards
Ashok

--
*Ashok Nulguda
*
*TATA ELXSI LTD*
*Mb : +91 9689945767
Mb : +91 9637095767
Land line : 2702044871
*
*Email :ash...@tataelxsi.co.in mailto:tshrik...@tataelxsi.co.in*


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


--
Patrice Hamelin
Specialiste sénior en systèmes d'exploitation | Senior OS specialist
Environnement Canada | Environment Canada
2121, route Transcanadienne | 2121 Transcanada Highway
Dorval, QC H9P 1J3
Téléphone | Telephone 514-421-5303
Télécopieur | Facsimile 514-421-7231
Gouvernement du Canada | Government of Canada

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Need Help

2012-01-07 Thread Ashok nulguda
Dear All,

We have Lustre 1.8.4 installed with 2 MDS servers and 2 OSS servers with 17
OSTes and 1 MDT with ha configured on both my MDS and OSS.
problem:-
Some of my OSTes are not mounting on my OSS servers.
When i try to maunully mount it  through errors  failed: Transport
endpoint is not connected
commnd :-mount -t lustre /dev/mapper/..   /OST1
 failed: Transport endpoint is not connected

however, when we login and check MDS server for lustre ost status we found
cat /proc/fs/lustre/mds/lustre-MDT/recovery_status
It shows completed
And also
cat /proc/fs/lustre/devices
All my mdt and ost are showing up status.

Can anyone help us it debuging.


Thanks and Regards
Ashok

-- 
*Ashok Nulguda
*
*TATA ELXSI LTD*
 *Mb : +91 9689945767
Mb : +91 9637095767
Land line : 2702044871
*
*Email :ash...@tataelxsi.co.in tshrik...@tataelxsi.co.in*
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Need Help

2012-01-07 Thread Colin Faber

How are your OSTs connected to your OSSs?

-cf

-Original message-
From: Ashok nulguda ashok0...@gmail.com
To: Lustre Discussion list Lustre-discuss@lists.lustre.org
Sent: Sat, Jan 7, 2012 00:19:59 MST
Subject: [Lustre-discuss] Need Help


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Need help

2011-07-01 Thread Mervini, Joseph A
Hi,

I just upgraded our servers from RHEL 5.4 - RHEL 5.5 and went from lustre 
1.8.3 to 1.8.5. 

Now when I try to mount the OSTs I'm getting:

[root@aoss1 ~]# mount -t lustre /dev/disk/by-label/scratch2-OST0001 
/mnt/lustre/local/scratch2-OST0001
mount.lustre: mount /dev/disk/by-label/scratch2-OST0001 at 
/mnt/lustre/local/scratch2-OST0001 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)

tunefs.lustre looks okay on both the MDT (which is mounted) and the OSTs:

[root@amds1 ~]# tunefs.lustre /dev/disk/by-label/scratch2-MDT 
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

   Read previous values:
Target: scratch2-MDT
Index:  0
Lustre FS:  scratch2
Mount type: ldiskfs
Flags:  0x5
  (MDT MGS )
Persistent mount opts: errors=panic,iopen_nopriv,user_xattr,maxdirsize=2000
Parameters: lov.stripecount=4 failover.node=failnode@tcp1 
failover.node=failnode@o2ib1 mdt.group_upcall=/usr/sbin/l_getgroups


   Permanent disk data:
Target: scratch2-MDT
Index:  0
Lustre FS:  scratch2
Mount type: ldiskfs
Flags:  0x5
  (MDT MGS )
Persistent mount opts: errors=panic,iopen_nopriv,user_xattr,maxdirsize=2000
Parameters: lov.stripecount=4 failover.node=failnode@tcp1 
failover.node=failnode@o2ib1 mdt.group_upcall=/usr/sbin/l_getgroups

exiting before disk write.


[root@aoss1 ~]# tunefs.lustre /dev/disk/by-label/scratch2-OST0001
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

   Read previous values:
Target: scratch2-OST0001
Index:  1
Lustre FS:  scratch2
Mount type: ldiskfs
Flags:  0x2
  (OST )
Persistent mount opts: errors=panic,extents,mballoc
Parameters: mgsnode=mds-server1@tcp1 mgsnode=mds-server1@o2ib1 
mgsnode=mds-server2@tcp1 mgsnode=mds-server2@o2ib1 
failover.node=failnode@tcp1 failover.node=failnode@o2ib1


   Permanent disk data:
Target: scratch2-OST0001
Index:  1
Lustre FS:  scratch2
Mount type: ldiskfs
Flags:  0x2
  (OST )
Persistent mount opts: errors=panic,extents,mballoc
Parameters: mgsnode=mds-server1@tcp1 mgsnode=mds-server1@o2ib1 
mgsnode=mds-server2@tcp1 mgsnode=mds-server2@o2ib1 
failover.node=falnode@tcp1 failover.node=failnode@o2ib1

exiting before disk write.


I am really stuck and could really use some help.

Thanks.

==
 
Joe Mervini
Sandia National Laboratories
Dept 09326
PO Box 5800 MS-0823
Albuquerque NM 87185-0823
 


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Need help

2011-07-01 Thread Cliff White
Did you also install the correct e2fsprogs?
cliffw


On Fri, Jul 1, 2011 at 5:45 PM, Mervini, Joseph A jame...@sandia.govwrote:

 Hi,

 I just upgraded our servers from RHEL 5.4 - RHEL 5.5 and went from lustre
 1.8.3 to 1.8.5.

 Now when I try to mount the OSTs I'm getting:

 [root@aoss1 ~]# mount -t lustre /dev/disk/by-label/scratch2-OST0001
 /mnt/lustre/local/scratch2-OST0001
 mount.lustre: mount /dev/disk/by-label/scratch2-OST0001 at
 /mnt/lustre/local/scratch2-OST0001 failed: No such file or directory
 Is the MGS specification correct?
 Is the filesystem name correct?
 If upgrading, is the copied client log valid? (see upgrade docs)

 tunefs.lustre looks okay on both the MDT (which is mounted) and the OSTs:

 [root@amds1 ~]# tunefs.lustre /dev/disk/by-label/scratch2-MDT
 checking for existing Lustre data: found CONFIGS/mountdata
 Reading CONFIGS/mountdata

   Read previous values:
 Target: scratch2-MDT
 Index:  0
 Lustre FS:  scratch2
 Mount type: ldiskfs
 Flags:  0x5
  (MDT MGS )
 Persistent mount opts:
 errors=panic,iopen_nopriv,user_xattr,maxdirsize=2000
 Parameters: lov.stripecount=4 failover.node=failnode@tcp1
 failover.node=failnode@o2ib1 mdt.group_upcall=/usr/sbin/l_getgroups


   Permanent disk data:
 Target: scratch2-MDT
 Index:  0
 Lustre FS:  scratch2
 Mount type: ldiskfs
 Flags:  0x5
  (MDT MGS )
 Persistent mount opts:
 errors=panic,iopen_nopriv,user_xattr,maxdirsize=2000
 Parameters: lov.stripecount=4 failover.node=failnode@tcp1
 failover.node=failnode@o2ib1 mdt.group_upcall=/usr/sbin/l_getgroups

 exiting before disk write.


 [root@aoss1 ~]# tunefs.lustre /dev/disk/by-label/scratch2-OST0001
 checking for existing Lustre data: found CONFIGS/mountdata
 Reading CONFIGS/mountdata

   Read previous values:
 Target: scratch2-OST0001
 Index:  1
 Lustre FS:  scratch2
 Mount type: ldiskfs
 Flags:  0x2
  (OST )
 Persistent mount opts: errors=panic,extents,mballoc
 Parameters: mgsnode=mds-server1@tcp1 mgsnode=mds-server1@o2ib1
 mgsnode=mds-server2@tcp1 mgsnode=mds-server2@o2ib1
 failover.node=failnode@tcp1 failover.node=failnode@o2ib1


   Permanent disk data:
 Target: scratch2-OST0001
 Index:  1
 Lustre FS:  scratch2
 Mount type: ldiskfs
 Flags:  0x2
  (OST )
 Persistent mount opts: errors=panic,extents,mballoc
 Parameters: mgsnode=mds-server1@tcp1 mgsnode=mds-server1@o2ib1
 mgsnode=mds-server2@tcp1 mgsnode=mds-server2@o2ib1
 failover.node=falnode@tcp1 failover.node=failnode@o2ib1

 exiting before disk write.


 I am really stuck and could really use some help.

 Thanks.

 ==

 Joe Mervini
 Sandia National Laboratories
 Dept 09326
 PO Box 5800 MS-0823
 Albuquerque NM 87185-0823



 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss




-- 
cliffw
Support Guy
WhamCloud, Inc.
www.whamcloud.com
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help

2011-03-15 Thread Kevin Van Maren
Ashok nulguda wrote:
 Dear All,

 How to forcefully shutdown the luster service from client and OST and 
 MDS server when IO are opening.

For the servers, you can just umount them.  There will not be any file 
system corruption, but files will not have the latest data -- the cache 
on the clients will not be written to disk (unless recovery happens -- 
restart the servers without having rebooted the clients).  In an 
emergency, this is normally all you have time to do before shutting down 
the system.

To unmount clients, not only can there not be any IO, you also need to 
first kill every process that has an open file on Lustre.  lsof can be 
useful here if you don't want to do a full shutdown, but in many 
environments killing non-system processes is enough.

Normally you'd want to shutdown all the clients, and then the servers.

Kevin

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] need help

2011-03-12 Thread Ashok nulguda
Dear All,

How to forcefully shutdown the luster service from client and OST and MDS
server when IO are opening.


Thanks and Regards

Ashok



-- 
Ashok Y. Nulguda
System Administrator
Tata Elxsi,
Pune
mobile:-+91-9689945767
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Need Help

2011-03-11 Thread Ashok nulguda
Dear All,

How to forcefully shutdown the luster.


Thanks and Regards

Ashok

-- 
Ashok Y. Nulguda
System Administrator
Tata Elxsi,
Pune
mobile:-+91-9689945767
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-24 Thread Tina Friedrich
Cheers Andreas. I had actually found that, but there doesn't seem to be 
that much documentation about it. Or I didn't find it :) Plus it 
appeared to find the users that were problematic whenever I tried it, so 
I wondered if that is all there is, or if there's some other mechanism I 
could test for.

Tina

On 23/09/10 22:25, Andreas Dilger wrote:
 On 2010-09-23, at 08:03, Tina Friedrich wrote:
 Still - could someone point me to the bit in the documentation that best
 describes how the MDS queries that sort of information (group/passwd
 info, I mean)? Or how to best test that it's mechanisms are working? For
 example, in this case, I always thought one would only hit the size
 limit if doing a bulk 'transfer' of data, not doing a lookup on one user
 - plus I could do these sort lookups fine on all machines involved
 (against all ldap servers).

 You can run l_getgroups -d {uid} (the utility that the MDS uses to query 
 the groups database/LDAP) from the command-line.

 Cheers, Andreas
 --
 Andreas Dilger
 Lustre Technical Lead
 Oracle Corporation Canada Inc.




-- 
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

-- 
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-24 Thread Daniel Kobras
Hi!

On Fri, Sep 24, 2010 at 09:18:15AM +0100, Tina Friedrich wrote:
 Cheers Andreas. I had actually found that, but there doesn't seem to be 
 that much documentation about it. Or I didn't find it :) Plus it 
 appeared to find the users that were problematic whenever I tried it, so 
 I wondered if that is all there is, or if there's some other mechanism I 
 could test for.

Mind that access to cached files is no longer authorized by the MDS, but by the
client itself. I wouldn't call it documentation, but
http://wiki.lustre.org/images/b/ba/Tuesday_lustre_automotive.pdf has an
illustration of why this is a problem when nameservices become out of sync
between MDS and Lustre clients (slides 23/24). Sounds like you hit a very
similar issue.

Regards,

Daniel.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-24 Thread Tina Friedrich
Actually, what I hit was one of the LDAP server private to the MDS 
errounously had a size limit set where the others are unlimited. They're 
round robin'd which is why I was seeing an inermittent effect. So not a 
client issue, the clients would not have used this server for their 
lookups.

Which is why I'm puzzled as to how this works, and trying to understand 
it a bit better; to my understanding, this should not affect lookups on 
single users, only 'bulk' transfers of data, at least as I understand this?

Tina

On 24/09/10 12:35, Daniel Kobras wrote:
 Hi!

 On Fri, Sep 24, 2010 at 09:18:15AM +0100, Tina Friedrich wrote:
 Cheers Andreas. I had actually found that, but there doesn't seem to be
 that much documentation about it. Or I didn't find it :) Plus it
 appeared to find the users that were problematic whenever I tried it, so
 I wondered if that is all there is, or if there's some other mechanism I
 could test for.

 Mind that access to cached files is no longer authorized by the MDS, but by 
 the
 client itself. I wouldn't call it documentation, but
 http://wiki.lustre.org/images/b/ba/Tuesday_lustre_automotive.pdf has an
 illustration of why this is a problem when nameservices become out of sync
 between MDS and Lustre clients (slides 23/24). Sounds like you hit a very
 similar issue.

 Regards,

 Daniel.
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss



-- 
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

-- 
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-24 Thread Andreas Dilger
I think there is a bit of confusion here. The MDS is doing the initial 
authorization for the file, using l_getgroups to access the group information 
from LDAP (or whatever database is used).

Daniel's point was that after the client has gotten access to the file, it will 
cache this file locally until the lock is dropped from the client. 

Cheers, Andreas

On 2010-09-24, at 7:58, Tina Friedrich tina.friedr...@diamond.ac.uk wrote:

 Actually, what I hit was one of the LDAP server private to the MDS 
 errounously had a size limit set where the others are unlimited. They're 
 round robin'd which is why I was seeing an inermittent effect. So not a 
 client issue, the clients would not have used this server for their 
 lookups.
 
 Which is why I'm puzzled as to how this works, and trying to understand 
 it a bit better; to my understanding, this should not affect lookups on 
 single users, only 'bulk' transfers of data, at least as I understand this?
 
 Tina
 
 On 24/09/10 12:35, Daniel Kobras wrote:
 Hi!
 
 On Fri, Sep 24, 2010 at 09:18:15AM +0100, Tina Friedrich wrote:
 Cheers Andreas. I had actually found that, but there doesn't seem to be
 that much documentation about it. Or I didn't find it :) Plus it
 appeared to find the users that were problematic whenever I tried it, so
 I wondered if that is all there is, or if there's some other mechanism I
 could test for.
 
 Mind that access to cached files is no longer authorized by the MDS, but by 
 the
 client itself. I wouldn't call it documentation, but
 http://wiki.lustre.org/images/b/ba/Tuesday_lustre_automotive.pdf has an
 illustration of why this is a problem when nameservices become out of sync
 between MDS and Lustre clients (slides 23/24). Sounds like you hit a very
 similar issue.
 
 Regards,
 
 Daniel.
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 
 
 
 -- 
 Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
 Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
 
 -- 
 This e-mail and any attachments may contain confidential, copyright and or 
 privileged material, and are for the use of the intended addressee only. If 
 you are not the intended addressee or an authorised recipient of the 
 addressee please notify us of receipt by returning the e-mail and do not use, 
 copy, retain, distribute or disclose the information in or attached to the 
 e-mail.
 Any opinions expressed within this e-mail are those of the individual and not 
 necessarily of Diamond Light Source Ltd. 
 Diamond Light Source Ltd. cannot guarantee that this e-mail or any 
 attachments are free from viruses and we cannot accept liability for any 
 damage which you may sustain as a result of software viruses which may be 
 transmitted in or with the message.
 Diamond Light Source Limited (company no. 4375679). Registered in England and 
 Wales with its registered office at Diamond House, Harwell Science and 
 Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 
 
 
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-24 Thread Fan Yong
  In fact, the issues occurred when MDS does the upcall (default 
processed by user space l_getgroups) for user/group information 
related with this RPC, one UID for each upcall, and all the 
supplementary groups (not more than sysconf(_SC_NGROUPS_MAX) count) of 
this UID will be returned. The whole process is not nothing related with 
single user or not. If it is the improper configuration (of LDAP) for 
some user(s) caused the failure, you have to verify all the users one by 
one.


Cheers,
Nasf

On 9/24/10 9:58 PM, Tina Friedrich wrote:
 Actually, what I hit was one of the LDAP server private to the MDS
 errounously had a size limit set where the others are unlimited. They're
 round robin'd which is why I was seeing an inermittent effect. So not a
 client issue, the clients would not have used this server for their
 lookups.

 Which is why I'm puzzled as to how this works, and trying to understand
 it a bit better; to my understanding, this should not affect lookups on
 single users, only 'bulk' transfers of data, at least as I understand this?

 Tina

 On 24/09/10 12:35, Daniel Kobras wrote:
 Hi!

 On Fri, Sep 24, 2010 at 09:18:15AM +0100, Tina Friedrich wrote:
 Cheers Andreas. I had actually found that, but there doesn't seem to be
 that much documentation about it. Or I didn't find it :) Plus it
 appeared to find the users that were problematic whenever I tried it, so
 I wondered if that is all there is, or if there's some other mechanism I
 could test for.
 Mind that access to cached files is no longer authorized by the MDS, but by 
 the
 client itself. I wouldn't call it documentation, but
 http://wiki.lustre.org/images/b/ba/Tuesday_lustre_automotive.pdf has an
 illustration of why this is a problem when nameservices become out of sync
 between MDS and Lustre clients (slides 23/24). Sounds like you hit a very
 similar issue.

 Regards,

 Daniel.
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] need help debuggin an access permission problem

2010-09-23 Thread Tina Friedrich
Hello List,

I'm after debugging hints...

I have a couple of users that intermittently get I/O errors when trying 
to ls a directory (as in, within half an hour, works - doesn't work - 
works...).

Users/groups are kept in ldap; as far as I can see/check, the ldap 
information is consistend everywhere (i.e. no replication failure or 
anything).

I am trying to figure out what is going on here/where this is going 
wrong. Can someone give me a hint on how to debug this? Specifically, 
how does the MDS look up this sort of information, could there be a 
'list too long' type of error involved, something like that?

Thanks,
Tina

-- 
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

-- 
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-23 Thread Tina Friedrich
Hi,

thanks for the answer. I found it in the meantime; one of our ldap 
servers had a wrong size limit entry.

The logs I had of course already looked at - they didn't yield much in 
terms of why, only what (as in, I could see it was permission errors, 
but they do of course not really tell you why you are getting them. 
There weren't any log entries that hinted at 'size limit exceeded' or 
anything.).

Still - could someone point me to the bit in the documentation that best 
describes how the MDS queries that sort of information (group/passwd 
info, I mean)? Or how to best test that it's mechanisms are working? For 
example, in this case, I always thought one would only hit the size 
limit if doing a bulk 'transfer' of data, not doing a lookup on one user 
- plus I could do these sort lookups fine on all machines involved 
(against all ldap servers).

Tina

On 23/09/10 11:20, Ashley Pittman wrote:

 On 23 Sep 2010, at 10:46, Tina Friedrich wrote:

 Hello List,

 I'm after debugging hints...

 I have a couple of users that intermittently get I/O errors when trying
 to ls a directory (as in, within half an hour, works -  doesn't work -
 works...).

 Users/groups are kept in ldap; as far as I can see/check, the ldap
 information is consistend everywhere (i.e. no replication failure or
 anything).

 I am trying to figure out what is going on here/where this is going
 wrong. Can someone give me a hint on how to debug this? Specifically,
 how does the MDS look up this sort of information, could there be a
 'list too long' type of error involved, something like that?

 Could you give an indication as to the number of files in the directory 
 concerned?  What is the full ls command issued (allowing for shell aliases) 
 and in the case where it works is there a large variation in the time it 
 takes when it does work?

 In terms of debugging it I'd say the log files for the client in question and 
 the MDS would be the most likely place to start.

 Ashley,



-- 
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

-- 
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-23 Thread Fan Yong
  On 9/23/10 10:03 PM, Tina Friedrich wrote:
 Hi,

 thanks for the answer. I found it in the meantime; one of our ldap
 servers had a wrong size limit entry.

 The logs I had of course already looked at - they didn't yield much in
 terms of why, only what (as in, I could see it was permission errors,
 but they do of course not really tell you why you are getting them.
 There weren't any log entries that hinted at 'size limit exceeded' or
 anything.).

 Still - could someone point me to the bit in the documentation that best
 describes how the MDS queries that sort of information (group/passwd
 info, I mean)? Or how to best test that it's mechanisms are working? For
 example, in this case, I always thought one would only hit the size
 limit if doing a bulk 'transfer' of data, not doing a lookup on one user
 - plus I could do these sort lookups fine on all machines involved
 (against all ldap servers).
The topic about User/Group Cache Upcall maybe helpful for you.
For lustre-1.8.x, it is chapter of 28.1; for lustre-2.0.x, it is chapter 
of 29.1.
Good Luck!

Cheers,
Nasf
 Tina

 On 23/09/10 11:20, Ashley Pittman wrote:
 On 23 Sep 2010, at 10:46, Tina Friedrich wrote:

 Hello List,

 I'm after debugging hints...

 I have a couple of users that intermittently get I/O errors when trying
 to ls a directory (as in, within half an hour, works -   doesn't work -
 works...).

 Users/groups are kept in ldap; as far as I can see/check, the ldap
 information is consistend everywhere (i.e. no replication failure or
 anything).

 I am trying to figure out what is going on here/where this is going
 wrong. Can someone give me a hint on how to debug this? Specifically,
 how does the MDS look up this sort of information, could there be a
 'list too long' type of error involved, something like that?
 Could you give an indication as to the number of files in the directory 
 concerned?  What is the full ls command issued (allowing for shell aliases) 
 and in the case where it works is there a large variation in the time it 
 takes when it does work?

 In terms of debugging it I'd say the log files for the client in question 
 and the MDS would be the most likely place to start.

 Ashley,



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-23 Thread Andreas Dilger
On 2010-09-23, at 08:03, Tina Friedrich wrote:
 Still - could someone point me to the bit in the documentation that best 
 describes how the MDS queries that sort of information (group/passwd 
 info, I mean)? Or how to best test that it's mechanisms are working? For 
 example, in this case, I always thought one would only hit the size 
 limit if doing a bulk 'transfer' of data, not doing a lookup on one user 
 - plus I could do these sort lookups fine on all machines involved 
 (against all ldap servers).

You can run l_getgroups -d {uid} (the utility that the MDS uses to query the 
groups database/LDAP) from the command-line.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss