Auto run SFC and DSIM maintenance on windows update failure

Is it possible to have a recovery action, if Windows Update (Beta) fails to install other than “reboot required and reboot preference was set to suppress” to automatically run sfc and dism scans then try again?

or have an option in deployments for a recovery action box where we can add our own scripts to run in the event a failed result is returned? if this is the option than it would be good to have it on all deployments including global ones.

SFC and DISM are a good post-update practice, but they don’t offer a lot of effectiveness prior to installing an update. You’d be better served clearing “C:\Windows\SoftwareDistribution”, re-registering DLLs, among other things. A task already exists for that, called “Clear/Repair Windows Update Cache Task (Beta)”.

It would be nice if, when Install Windows Updates fails, it produced a child task to clear the update cache and try again. However, there’s only a certain number of passes it should be allowed before someone should step in. Not a great idea to have it repeatedly trying overnight. So I can imagine the scripting gets a little intense trying to find a balance.

That being said, for other uses, I agree it would be nice to have SFC and DISM ready to run on the fly.

1 Like

I would agree. I think that if exceptions occur in the Windows Update task (based on the error codes, etc), it could do the Windows Update Cache Task and go through SFC/DISM.

Honestly SFC/DISM have probably helped me the most. Both tasks are needed though.

I’m certainly open to taking actions to correct a problem with Windows Updates, but I (personally) won’t do so “blindly.”

If there are specific error conditions that we can detect and we know the specific process needed in order to fix those error conditions, then I’m happy to incorporate that into the script.

But, I believe it is a pretty big mistake to make the assumption that if Windows Updates fail for any reason that we should run SFC and DISM. What if we run those and continue to see failure? Then what? Depending on the system, this has the potential to extend the runtime of this task for a significant amount of time without guarantee of a successful outcome.

The problem with the Update exceptions is that you really won’t know until you diagnose the problem manually. As you said that is a huge detriment to developing this script.

My proposal:
Do pre-checks (Is ThreatLocker installed, machine has a HDD, CPU/DISK/MEMORY (or any of those metrics) are high. Stop the deployment with a notice of the result.
Assuming the pre-checks succeed, continue with the update.

The other option is to include the most popular issues that you and others see, and put that into the pre-check.

Hope that makes sense

Could maybe be added to preflight to verify image health prior to starting updates. Though to minimize runtime, it should be sufficient to run SFC /verifyonly and DISM /online /cleanup-image /scanhealth, no? Then if either of those return as unhealthy, then run the corresponding fix?

Alternatively, it seems like Get-WindowsUpdateLog can be used to merge the .etl files in “C:\Windows\Logs\WindowsUpdate” into one .log file. This could probably be parsed and scanned for errors.

1 Like

That would mean (for us) that dism/sfc would run somewhere between 6 and 10 times per maintenance session.

Preflight was designed to make sure a machine was in the proper state to actually succeed when attempting to run scripts.

The right way to do this is to take action based on a detectable error, everything else is a shot in the dark that may or may not succeed, and would significantly extend the runtime of this.

If you run into specific issues with known specific fixes, I will happily code them into the script.

Can we not set a flag so it checks once at the start of the task, then never again

like detect runs the sfc verify and dism scan, kicks off a dependant pre-requisite tasks to run dism and sfc tasks, then returns to windows update for the set?

True, but you could technically use a registry entry to have a count of how many times the session has run. Like “HKLM:\Software\ImmyBot\Flags” with an integer value called “UpdatesCount” or something. If it’s on the first run, then run sfc/dism, if not then skip it and continue. Zero the count when the session is done.

The problem I see with chasing explicit error codes (especially with Microsoft error codes) is that eventually you hit the generic errors which basically just mean “it’s broken :man_shrugging:”. At that point, how do you dive into whether or not those kinds of errors should or should not run sfc/dism? Also, if system files are unhealthy/corrupted, should updates be allowed to continue? (especially in the case of feature updates that require rebooting to the Safe OS phase)

That being said, IF things are more specific, then maybe it’s possible to have a built in troubleshooter? If error code XXXX is present, run sfc /dism. If error code XOXO is present, clear software distribution folder. If error code XXOO is present, write custom warning “The device does not support the update to Windows 11” to the console.

FYI, already looking into some error codes. Just want to know if my concerns are baseless haha

Why do this?

Write a task that just runs sfc and/or dism once per session—“done.”

I guess that works too. I just figured, considering how finicky updates can be, it might be a good pre-check. Kind of like putting on your seat belt before you start driving :man_shrugging:

I’ll start a task, and if it works well I’ll post it in the community later

Is there a reason we can’t include common scenarios, like a dism check, to the deployment?

As you have said, there are thousands of potential errors, and pointless to an exception for each one. What we are saying is that chkdsk, dism and sfc /scanow are the most basic tools if any exception is thrown.

A specific issue would be if a machine is reporting network latency, or if we see a HDD exists, frankly any sudden slowdown in the execution.

I just don’t see how sfc, dism, etc would not be beneficial to enhancing this script.

What common scenarios are you referring to?

High Disk Usage.
Check if the computer has a HDD - put a warning that this task is likely to fail and or take a lot longer because of the HDD.
Check if anti-virus (like SentinelOne) is competing for resource utilization with Microsoft.
Check if any process is competing for 100% resource utilization after being sustained for >5 minutes.
High CPU Usage.
See above.
ThreatLocker (cdn.immy.bot not excluded). This one might not be possible since their API isn’t public.
Check if the computer has a old i* procesor.

We’ve seen that HDD’s and i3/gen 4 i5’s cause bottlenecks.

You could apply anti-virus to any of the most popular ones.

Hope that helps.