Issue with 6.1.8

Alexandre Oliva lxoliva at fsfla.org
Sun Mar 12 16:38:41 UTC 2023


On Mar 12, 2023, Alexandre Oliva <lxoliva at fsfla.org> wrote:

> 0000:00:02.0: Missing Free firmware (non-Free firmware loading is disabled)
> [30 seconds elapse]
> Out of memory: Killed process 333 (udevd) [...]
> udevd[320]: worked [333] failed while handling '/devices/pci0000:00/0000:00:02.0'
> Out of memory: Killed process 470 (modprobe) [...]
> i915 0000:00:02.0: GuC firmware /*(DEBLOBBED)*/: fetch failed with error -12

I have a theory as to what the problem might be.  Whether this is what's
hitting you or not, I can't be sure, but it is likely a problem.

drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c has a new loop in
intel_uc_fw_fetch in 6.0, that enables multiple firmware versions to be
tried.  Formerly, it tried a preferred and a fallback, and that was all.
In the new version, it may loop indefinitely, given certain unexpected
circumstances, and it looks like deblobbing of the firmware names
creates the unexpected circumstances that make it loop indefinitely.

Now, I can tell the GuC firmware loading loop completes, because of the
last quoted line above, but the same function is used to load HuC
firmware, and it may also loop indefinitely for HuC.

If my theory is correct, setting the i915.huc_firmware_path param to
random garbage, say /does/not/exist, will prevent it from entering the
loop, avoding the problem.  If I'm missing some other piece of the
puzzle, perhaps setting i915.guc_firmware_path would help, despite the
above; maybe the error above comes up in a retry or something.


Now, here's my theory of why we're looping forever when
intel_uc_fw_fetch is asked to load HuC:

firmware_reject_nowarn fails to the initially-selected firmware name.
If the param is not set, that's not a final fail, and so we enter the
loop:

        err = firmware_reject_nowarn(&fw, uc_fw->file_selected.path, dev);
[...]
        /* Any error is terminal if overriding. Don't bother searching for older versions */
        if (err && intel_uc_fw_is_overridden(uc_fw))
                goto fail;
[...]
        while (err == -ENOENT) {
[...]
                __uc_fw_auto_select(i915, uc_fw);
[...]
                err = firmware_reject_nowarn(&fw, uc_fw->file_selected.path, dev);

Now, how does __uc_fw_auto_select choose the next firmware name to try?
It goes through the GuC or HuC array, skipping entries that don't match
the current platform.  When it finds one, if file_selected.path is NULL,
that's the one it's going to use next (or first), but if it's not NULL,
then it checks whether it's the same string we used last time.  If it's
not, then it also skips it, but if it is the same string, then it knows
it found the point of the last try in the array, so it resets
file_selected.path so that the next entry found is used.

               if (uc_fw->file_selected.path) {
                        if (uc_fw->file_selected.path == blob->path)
                                uc_fw->file_selected.path = NULL;

                        continue;
                }

                uc_fw->file_selected.path = blob->path;

Now, the HuC array is initialized from:

#define INTEL_HUC_FIRMWARE_DEFS(fw_def, huc_raw, huc_mmp) \
       fw_def(ALDERLAKE_P,  0, huc_raw(tgl)) \
       fw_def(ALDERLAKE_P,  0, huc_mmp(tgl,  7, 9, 3)) \
       fw_def(ALDERLAKE_S,  0, huc_raw(tgl)) \
       fw_def(ALDERLAKE_S,  0, huc_mmp(tgl,  7, 9, 3)) \
[...]

that builds up string literals for each blob depending on platform,
huc/guc/, and versions.

When the strings are all different, the '== blob->path' test above will
only succeed if we have indeed found the last entry we went through.

But when all strings are "/*(DEBLOBBED)*/", it's not only the strings
that compare equal, the *pointers* to them (that's what blob->path
holds) may also compare equal as the compiler (or even the linker)
unifies identical strings.

So, we find the first applicable entry, try it, fail, scan the array for
it, find it, keep looking, find the next entry, select it, and try
again.  That also fails, so we try to look for yet another fallback
entry, but because the string pointers are identical, the loop thinks
the last one that was tried was the first one, so it suggests the second
one again, and again, and again...  indefinitely.

I suppose something in the loop is allocating memory and that's why it
runs out of memory after gazillions of retries.


Now, why should the workaround of setting the parameter work?
That's because, when the parameter is set, intel_uc_fw_is_overridden
returns true, so we skip the loop entirely.


Now, I'd appreciate if either (or both) of yu could confirm whether the
workaround does it indeed work around the issue.  Meanwhile, I'll try to
find a way to change the code so that it doesn't fail us, that can fit
in our cleaning-up scripts.

Thanks again for the reports!

-- 
Alexandre Oliva, happy hacker                https://FSFLA.org/blogs/lxo/
   Free Software Activist                       GNU Toolchain Engineer
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>


More information about the linux-libre mailing list