AMDGPU Support

Tue May 10 01:04:47 UTC 2022

On Mon, 2 May 2022 16:56:20 +0100
Morris Zuss <morris at vlen.org> wrote:
> > *ERROR* sw_init of IP block <gfx_v8_0> failed -2
> 
> One thing I did notice is that the amdgpu is being loaded now
> according to lspci -k. Although I cannot seem to unload the amdgpu
> module and when I try to shutdown it just hangs indefinitely unless I
> force it off.
Do you see something on the display?

> After compiling again, the dmesg output was pretty much the same: 
> https://termbin.com/x5kx

> amdgpu 0000:07:00.0: amdgpu: gfx8: 
> Failed to load firmware "/*(DEBLOBBED)*/"

> [drm:gfx_v8_0_sw_init.cold [amdgpu]] 
> *ERROR* Failed to load gfx firmware!

These two prints are due to:
> r = gfx_v8_0_init_microcode(adev);
> if (r) {
>     DRM_ERROR("Failed to load gfx firmware!\n");
>     return r;
> }
in gfx_v8_0_sw_init in amdgpu/gfx_v8_0.c.

And here it seems that the return r wasn't patched correctly for some
reason. If you somehow modified the Parabola kernel, can you look at
the source of gfx_v8_0.c to verify that the return r has been replaced
by a comment somehow?

Several things seems to indicate that:
- The -2 looks like a failed firmware load since -2 is -ENOENT (No
  such file or directory).
- If another firmware would fail it would print something after the 
  "Failed to load gfx firmware!" and before the
  "*ERROR* sw_init of IP block <gfx_v8_0> failed -2"
- I found no way the code could do this return -2 if it didn't fail
  there, though there was a lot of code to read so I could have missed
  it as well[1].

Another issue could be that you are not running the kernel that
corresponds to the source you patched.

That happened to me very often and it lead to a lot of time spent
on impossible debug sessions so I try to find ways to match the binary
being built and the log from the running kernel.

A way to do that in your case would be to look at when the kernel image
was last created (with ls -l) and compare that with the time at which
the kernel used to produce the logs was built, which we can see with
dmesg, at the beginning of the log. Here's an example:
> [    0.000000] Linux version 5.15.12-gnu-1-pae
> (linux-libre-pae at parabola) (gcc (GCC) 11.1.0, GNU ld (GNU Binutils)
> 2.36.1) #1 SMP PREEMPT Mon, 10 Jan 2022 01:47:28 +0000
Here apparently that kernel was built the 10 January 2022.

Note that GNU/Linux distributions tend to disable that feature as having
the build time in the binaries prevents reproducible builds, but in our
case here it is very useful, and it seems that Parabola keeps such
information, so we should be good here.

References:
-----------
[1]For the record there is a more detailed analysis of the code that
   shows that -2 probably comes from gfx_v8_0_init_microcode:

   That print:
   > [drm:amdgpu_device_init.cold [amdgpu]]
   > *ERROR* sw_init of IP block <gfx_v8_0> failed -2

   comes from that code:
   > r = adev->ip_blocks[i].version->funcs->sw_init((void *)adev);
   > if (r) {
   >     DRM_ERROR("sw_init of IP block <%s> failed %d\n",
   >                adev->ip_blocks[i].version->funcs->name, r);
   >     goto init_failed;
   > }

   And that .sw_init function is gfx_v8_0_sw_init (it's similar to the
   .hw_init function we had before).

   So it means that, if your patching works, it probably would have
failed after it and before the end of the gfx_v8_0_sw_init function,
and without printing anything in between.

   If we remove the parts that print right before returning an error,
   we're left with the following code:
   > r = amdgpu_ring_init(adev, ring, 1024, &adev->gfx.eop_irq,
   >                      AMDGPU_CP_IRQ_GFX_ME0_PIPE0_EOP,
   >                      AMDGPU_RING_PRIO_DEFAULT);
   > if (r)
   >     return r;
   This either prints or doesn't return -2 / -ENOENT (No such file or
   directory), so we can rule that part out.

   > r = gfx_v8_0_compute_ring_init(adev, ring_id, i, k, j);
   > if (r)
   >     return r;
   This also can't fail here as it only returns error
   from amdgpu_ring_init which doesn't return -2 without printing.

   > r = amdgpu_gfx_kiq_init_ring(adev, &kiq->ring, &kiq->irq);
   > if (r)
   >     return r;
   This should only return 0 or -22 / -EINVAL or print errors before
   returning, so we can rule that out too.

   > r = amdgpu_gfx_mqd_sw_init(adev, sizeof(struct vi_mqd_allocation));
   > if (r)
   >     return r;
   This should either return 0 or print before returning errors, so it's
   not that either.

   And with:
   > r = gfx_v8_0_gpu_early_init(adev);
   > if (r)
   >     return r;

   We only have this code that is interesting:
   > case CHIP_POLARIS11:
   > case CHIP_POLARIS12:
   >     ret = amdgpu_atombios_get_gfx_info(adev);
   >     if (ret)
   >         return ret;
   >     [...]
   >     break;
   And that can only return -22 (-EINVAL) or 0.

Denis.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://www.fsfla.org/pipermail/linux-libre/attachments/20220510/0998c1d1/attachment.sig>