NetWorker – Random stalling
One of the things I’ve spent a lot of time with, has been EMC NetWorker (previously Legato NetWorker).
A vaguely common issue is for a process of some kind - backups, staging to tape, restores, etc - for no reason just stop making any new progress.
Once you’ve checked off the common reasons - like making sure you haven’t run out of disk space or usable tapes - it seems like the only option is to restart NetWorker as a whole, losing any in-progress actions (even ones that are to devices that haven’t stalled).
I suspect that random underlying I/O issues can occasionally upset it, and it doesn’t quite recover. But, whatever. How do you make it recover a single device, without restarting the whole thing?
First up, get the PID of the main nsrd
process. On Solaris, ps -ef | grep nsrd
; or on Linux ps uaxw | grep nsrd
.
Assuming the PID is 1234
, you next need to run: dbgcommand -p 1234 PrintDevInfo
It should pretty quickly spit out a whole stack of debugging info to /nsr/logs/daemon.raw
. It’s moderately complicated, but you should see that it’s a dump of its internal state of each device, including d_device
- the *nix device or directory, and mm_number
- the unique ID for the nsrmmd
process for that device.
So - find the device you’re interested in, and find the mm_number
for that device.
Get a list of your nsrmmd
processes, eg. ps -ef | nsrmmd
or ps auxw | grep nsrmmd
. If your mm_number
is 5, then there will be a process nsrmmd -n 5
Kill the process, and it should re-spawn by itself on further access.