Dear Chavdar, Michall and others,

Just an update on the issue I raised a few days ago concerning the failed 
backups.

Our ISP server uses a separate windows server with about 100 TB of containers, 
divided over 28 disks.
Backups started crashing about a week ago.
At some point the system would get stuck for five minutes or so.

Of course, a lot of different issues happened at the same time:

-The problem started after windows server updates, which forced us to reboot 
most of our systems.
-One user dumped about a terabyte of mostly small files on our system.
- Our Spectrum Protect system manager was on holidays.
-As always there are other usual suspects: antivirus etc.

Our container server runs on Vmware ESXI infrastructure. We opened a call to 
VMWARE, sent them the logs of the ESXI server.
They found a very simple cause to the problem: disks were filled up, and the 
system froze.

When checking the logs, I found that the backup containers opened in write mode 
were on disks without any space left, while other disks were less than half 
full.
So here is my solution: set the containerdirs that are full on read-only, move 
containers, wait till the containers are deleted. 

My question is: why is this process not managed automatically by ISP? Why are 
disks with a lot of space not prioritized for writing?

Thanks for your help !

David de Leeuw

-----Original Message-----
From: ADSM: Dist Stor Manager <ADSM-L@VM.MARIST.EDU> On Behalf Of Chavdar Cholev
Sent: Monday, August 21, 2023 6:11 PM
To: ADSM-L@VM.MARIST.EDU
Subject: Re: [ADSM-L] INCR backups fail ! TSM 8.1.17 Windows Server and client

Hi David,
Just make sure that containers are excluded from anti-virus scan.

On Sunday, August 20, 2023, David L.A. De Leeuw <da...@bgu.ac.il> wrote:

> Hi all,
>
> Apparently, this has nothing to do with SP at all !
>
> The (Windows server 2019 on ESXI) system holding the containers just 
> disconnects for 5 minutes !
>
> No pings to the server.
>
> When access is restored, later on, a message appears in the events:
> "The system time has changed to 2023-08-20T19:05:05 from
> 2023-08-20T19:01:04  "
> This is no warning even, just "information".
>
> I have no idea why this should happen, but we will find it.
> Thanks for your support !
>
> David
>
>
>
> -----Original Message-----
> From: דוד דה ליאו
> Sent: Sunday, August 20, 2023 9:37 PM
> To: ADSM: Dist Stor Manager <ADSM-L@VM.MARIST.EDU>
> Subject: RE: [ADSM-L] INCR backups fail ! TSM 8.1.17 Windows Server 
> and client
>
> Hi Michael,
>
> Thanks a lot.
>
> The SP Server is not on VM, just the storage. I am not the manager to 
> the server.
> Just got a lot of backup storage if we provide the space for the 
> containers.
>
> Sure we run a lot of sessions in parallel as you said. I will try a 
> run according to your recommendations.
> One other thought I am testing, is that over a year ago we also had 
> crashes. The 10 Gb optical network had hickups. Our 1 Gb line worked fine.
> I just switched back to the 1 Gb and see what happens.
>
> Will keep you posted !
>
> David
>
>
> -----Original Message-----
> From: ADSM: Dist Stor Manager <ADSM-L@VM.MARIST.EDU> On Behalf Of 
> Michael Prix
> Sent: Sunday, August 20, 2023 9:04 PM
> To: ADSM-L@VM.MARIST.EDU
> Subject: Re: [ADSM-L] INCR backups fail ! TSM 8.1.17 Windows Server 
> and client
>
> Hello David,
>
>   an *SP-Server in a VM is not the best setup, but nevertheless it 
> should work - and has proven so for the past.
>
> For the client: Please show the dsm.opt. I suspect you are trunning 
> several sessions from this client in parallel during a backup-> stop 
> it for the moment.
> Start with a basic dsm.opt, disable the option "resourceutilization", 
> if set,  and set "memoryefficient yes" (or "diskcachem" if you like). 
> I'f it still crashes with a plain dsm.opt, you should open a ticket with IBM.
>
> --
> Michael Prix
>
>
>
>
> August 20, 2023 at 7:25 PM, "David L.A. De Leeuw" <da...@bgu.ac.il> wrote:
>
>
> >
> > Hi Chavdar and Michael,
> >
> > Thanks for your thoughts and help.
> >
> > I added "memoryefficientbackup".
> >
> > But still the sessions keep crashing. Once the session crashes, I 
> > get a
> whole bit of errors for storage pool directories, and in fact the 
> whole pool becomes unavailable.
> > I run "update stgpooldir ... access=readwrite" and all is accessible
> again.
> > Some of the containers are in unavailable state and need audit.
> >
> > Our container storage is on a Dell PowerEdge R730xd, has 24 CPU's
> allocated, 64 GB memory, 110 TB disk. The disks are declared as VMDKs.
> Network is on a 10Gb Intel 82588 card.
> > Nothing I can see points to a lack of resources.
> >
> > Everything worked fine till 4 days ago. That is why I thought of a
> problem with Windows updates, but as I rolled them back, that does not 
> make sense.
> >
> > I am quite at a loss where to look next ...
> >
> > Thanks
> >
> > David
> >
> > [Server Side] .
> > 20-08-2023 19:47:22 ANR0839I Session 197902 started for node MEDFS2
> (WinNT)
> >  (SSL medspice.bgu.ac.il[132.72.73.246]:53184) on  
> > STOREWARE13.auth.ad.bgu.ac.il:1502. (SESSION: 197902)
> > 20-08-2023 19:47:26 ANR8592I Session 197903 connection is using 
> > protocol  TLSV13, cipher specification TLS_AES_256_GCM_SHA384,  
> > certificate TSM Self-Signed Certificate. (SESSION:
> >  197903)
> > 20-08-2023 19:47:26 ANR0839I Session 197903 started for node MEDFS2
> (WinNT)
> >  (SSL medspice.bgu.ac.il[132.72.73.246]:53185) on  
> > STOREWARE13.auth.ad.bgu.ac.il:1502. (SESSION: 197903)
> > 20-08-2023 19:47:55 ANR2012W Error encountered for storage pool
> directory:
> >  \\medbackup.med.ad.bgu.ac.il\tsmc20 in storage pool:
> >  CPOOL. (SESSION: 197881)
> > 20-08-2023 19:47:55 ANR1181E sdtxn.c(1404): Data storage transaction
> >  0:83236375 was aborted. (SESSION: 197881)
> > 20-08-2023 19:47:55 ANR0204I The container state for
> >  \\medbackup.med.ad.bgu.ac.il\tsmc17\18\0000000000001853.-
> >  ncf is updated from AVAILABLE to UNAVAILABLE. (SESSION:
> >  197883)
> > 20-08-2023 19:47:55 ANR3660E An unexpected error occurred while 
> > opening
> or
> >  writing to the container. Container
> >  \\medbackup.med.ad.bgu.ac.il\tsmc17\18\0000000000001853.-
> >  ncf in stgpool CPOOL has been marked as UNAVAILABLE and  should be 
> > audited to validate accessibility and content.
> >  (SESSION: 197883)
> >
> > [From the client side:]
> >
> > During the incr of a large filespace:
> >
> > Normal File--> 7.132.827 \\medfs2\e$\medusers14\angel\17.8.23 BU -
> E\MyDocs(E)-PrevOLD-D\MyDocs (D)\PERSON-CRITER\FAMILY\OMRI's folder 
> 313843070\OMRI 1-16 medical issue\MRIs - CTs - OMRI\MY PROCESSING of 
> MRI and general MRI data\For-Crop-T2W - coronal Copy.pptx ** 
> Unsuccessful **
> > ANS1228E Sending of object '\\medfs2\e$\medusers14\angel\17.8.23 BU 
> > -
> E\MyDocs(E)-PrevOLD-D\MyDocs (D)\PERSON-CRITER\FAMILY\OMRI's folder 
> 313843070\OMRI 1-16 medical issue\MRIs - CTs - OMRI\MY PROCESSING of 
> MRI and general MRI data\For-Crop-T2W - coronal Copy.pptx' failed.
> > ANS1311E Server out of data storage space
> >
> > [I ran sel of the latest file. It failed because all containerdirs 
> > were
> unavailable.]
> >
> > ANS1804E Selective Backup processing of 
> > '\\medfs2\e$\medusers14\angel\17.8.23
> BU - E\MyDocs(E)-PrevOLD-D\MyDocs (D)\PERSON-CRITER\FAMILY\OMRI's 
> folder 313843070\OMRI 1-16 medical issue\MRIs - CTs - OMRI\MY 
> PROCESSING of MRI and general MRI data\For-Crop-T2W - coronal 
> Copy.pptx' finished with failures.
> >
> > Total number of objects inspected: 1 Total number of objects backed 
> > up: 0 Total number of objects updated: 0 Total number of objects 
> > rebound: 0 Total number of objects deleted: 0 Total number of 
> > objects expired: 0 Total number of objects failed: 1  ...
> > Network data transfer rate: 148.306,35 KB/sec Aggregate data 
> > transfer rate: 211,50 KB/sec Objects compressed by: 0% Total data 
> > reduction ratio: 0.23% Subfile objects reduced by: 0% Elapsed 
> > processing time: 00:00:32 ANS1311E Server out of data storage space
> >
> > [Then I updated the containerdirs to readwrite and ran the selective
> backup. No problem]
> > ------------------------------------------------------------
> -----------------------------------------------
> > Protect> sel '\\medfs2\e$\medusers14\angel\17.8.23 BU -
> E\MyDocs(E)-PrevOLD-D\MyDocs (D)\PERSON-CRITER\FAMILY\OMRI's folder 
> 313843070\OMRI 1-16 medical issue\MRIs - CTs - OMRI\MY PROCESSING of 
> MRI and general MRI data\For-Crop-T2W - coronal Copy.pptx'
> > Selective Backup function invoked.
> >
> > Normal File--> 7.132.827 \\medfs2\e$\medusers14\angel\17.8.23 BU -
> E\MyDocs(E)-PrevOLD-D\MyDocs (D)\PERSON-CRITER\FAMILY\OMRI's folder 
> 313843070\OMRI 1-16 medical issue\MRIs - CTs - OMRI\MY PROCESSING of 
> MRI and general MRI data\For-Crop-T2W - coronal Copy.pptx [Sent]
> > Selective Backup processing of '\\medfs2\e$\medusers14\angel\17.8.23 
> > BU
> - E\MyDocs(E)-PrevOLD-D\MyDocs (D)\PERSON-CRITER\FAMILY\OMRI's folder 
> 313843070\OMRI 1-16 medical issue\MRIs - CTs - OMRI\MY PROCESSING of 
> MRI and general MRI data\For-Crop-T2W - coronal Copy.pptx' finished 
> without failure.
> >
> > -----Original Message-----
> > From: ADSM: Dist Stor Manager <ADSM-L@VM.MARIST.EDU> On Behalf Of
> Chavdar Cholev
> > Sent: Sunday, August 20, 2023 3:43 PM
> > To: ADSM-L@VM.MARIST.EDU
> > Subject: Re: [ADSM-L] INCR backups fail ! TSM 8.1.17 Windows Server 
> > and
> client
> >
> > Just to make sure that we are on the same page...
> > You have TSM installed on VM running on VMware. This VM has few LUNs
> presented and those LUN are used for containers?
> >
> > Short in the dark:
> > 1. Check VM resources if they are as IBM TSM blue print.
> > 2. Check LUNs/HDDs response time in perf. monitor. The response time
> should around 20-30 Ms during the backup operating.
> > 3. Do you know if those HDDd for LUNs are .vmdk or RDM (raw device map)?
> >
> > Thank you!
> > Chavdar
> >
> > On Saturday, August 19, 2023, David L.A. De Leeuw <da...@bgu.ac.il>
> wrote:
> >
> > >
> > > Hi TSM experts,
> > >
> > >  Our incr backup fails consistently in the last few days. It 
> > > starts  alright but after a few gigabyte on the client we get the error:
> > >
> > >  ANS1301E This operation cannot continue due to an error on the 
> > > IBM  Spectrum Protect server. See your IBM Spectrum Protect server  
> > > administrator for assistance.
> > >
> > >  On the server side we see:
> > >
> > >  18-08-2023 22:57:25 ANR2012W Error encountered for storage pool
> directory:
> > >  \\medbackup.med.ad.bgu.ac.il\tsmc1 in storage pool:
> > >  CPOOL. (SESSION: 194578)
> > >  18-08-2023 22:57:25 ANR0530W Transaction failed for session 
> > > 194578
> for
> > >  node
> > >  MEDFS2 (WinNT) - internal server error detected.
> > >  (SESSION: 194578)
> > >  18-08-2023 22:57:26 ANR2012W Error encountered for storage pool
> directory:
> > >  \\medbackup.med.ad.bgu.ac.il\tsmc1 in storage pool:
> > >  CPOOL. (SESSION: 194578)
> > >
> > >  Then we find one or more containers unavailable. We fix the
> containers
> > >  with "audit container ... action=scanall"
> > >  No errors are found. But the next backup will fail again.
> > >
> > >  The server is on 8.1.17, the client as well.
> > >  The containers are on a number of disks on a shared windows 
> > > server
> 2019.
> > >  There have been some updates on the windows server recently.
> > >  (KB5029247,KB5029647)
> > >
> > >  The audits are fine, data is accessible, but backups fail.
> > >  Any ideas ?
> > >
> > >  David de Leeuw
> > >  Ben-Gurion University of the Negev  Beer Sheva Israel
> > >
> >
>

Reply via email to