AMDGPU Support

Tue May 10 14:48:04 UTC 2022

On 10/05/2022 02:04, Denis 'GNUtoo' Carikli wrote:

> Do you see something on the display?

When I poweroff it just stays on a black screen. This is probably 
related to the console freezing issue when I try enter in my login 
details upon startup. When I change to a different tty, I get a black 
screen but can still login and startx from guessing where I'm at. It 
still says module amdgpu is in use when I try to "rmmod amdgpu". Issue 
is still the same after everything I've done below.

> That happened to me very often and it lead to a lot of time spent
> on impossible debug sessions so I try to find ways to match the binary
> being built and the log from the running kernel.

It appears I'm in the same situation as well. I did check the date/time 
of the kernel and compared with my build directory and it matched up, 
regardless, I'm making stupid mistakes somewhere along the way.

At least this time around it didn't stop at gfx_v8_0_sw_init. I 
recompiled the kernel from where we last left off and it gave a new 
error relating to sdma_v3_0.

Dmesg output after recompiling from where we last left off: 
https://termbin.com/69jk

> [   15.176279] [drm:sdma_v3_0_sw_init.cold [amdgpu]] *ERROR* Failed to 
> load sdma firmware!
> [   15.176340] [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* sw_init 
> of IP block <sdma_v3_0> failed -2
That brought me to sdma_v3_0.c:1141 and this code block:

>         r = sdma_v3_0_init_microcode(adev);
>         if (r) {
>                 DRM_ERROR("Failed to load sdma firmware!\n");
>                 return r;
>         }
To which I modified the deblob script to get rid of that return 
statement. Deblobbed and compiled everything, that brought us a bit 
farther again.

Dmesg output after the sdma changes: https://termbin.com/94t7

> [   27.847639] amdgpu 0000:07:00.0: amdgpu: amdgpu_uvd: Can't load 
> firmware "/*(DEBLOBBED)*/"
> [   27.847698] [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* sw_init 
> of IP block <uvd_v6_0> failed -2
The error message there leads me to amdgpu_uvd.c:218.

>         r = request_firmware(&adev->uvd.fw, fw_name, adev->dev);
>         if (r) {
>                 dev_err(adev->dev, "amdgpu_uvd: Can't load firmware 
> \"%s\"\n",
>                         fw_name);
>                 return r;
>         }

I modified the script to remove the return statement and compiled once more.

Dmesg output after the uvd changes: https://termbin.com/ek3z

> [   15.110568] amdgpu 0000:07:00.0: amdgpu: amdgpu_uvd: Can't load 
> firmware "/*(DEBLOBBED)*/"
> [   15.110572] BUG: kernel NULL pointer dereference, address: 
> 0000000000000008
> [   15.110576] #PF: supervisor read access in kernel mode
> [   15.110578] #PF: error_code(0x0000) - not-present page

It stops abruptly there.

I couldn't find any new errors to go off of and I'm not sure how to deal 
with the null pointer dereference, so I just followed the call trace. I 
went to get rid of the return statement at uvd_v6_0.c:406 just to see if 
it did anything. The dmesg output stayed the same.

>         r = amdgpu_uvd_sw_init(adev);
>         if (r)
>                 /*(DEBLOBBED)*/

Dmesg output after the uvd_v6 change: https://termbin.com/xw5w

This is what the deblob script currently looks like (minus the 
uvd_v6_0.c changes): https://termbin.com/voa4

Another output of lspci: https://termbin.com/0cvy