AMDGPU Support

Denis 'GNUtoo' Carikli GNUtoo at cyberdimension.org
Mon May 2 14:26:39 UTC 2022


On Fri, 29 Apr 2022 22:21:35 +0100
Morris Zuss <morris at vlen.org> wrote:
> But regardless, here it is: https://termbin.com/5cbm  
Thanks a lot.

Here's the interesting part in it:
> [drm] vm size is 128 GB, 2 levels, block size is 10-bit, fragment
> size is 9-bit 0000:07:00.0: Missing Free firmware (non-Free firmware
> loading is disabled) amdgpu: mc: Failed to load firmware
> "/*(DEBLOBBED)*/" [drm:gmc_v8_0_sw_init.cold [amdgpu]] *ERROR* Failed
> to load mc firmware!  

Here the log above is fine, even if right below the "return r;" is
removed, it will print the failures mentioned above.
> r = gmc_v8_0_init_microcode(adev);
> if (r) {
>     DRM_ERROR("Failed to load mc firmware!\n");
>     return r;
> }  

Then we can see it continues here, which is what we want:
> amdgpu 0000:07:00.0: amdgpu: VRAM: 4096M 0x000000F400000000 -
> 0x000000F4FFFFFFFF (4096M used) amdgpu 0000:07:00.0: amdgpu: GART:
> 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF [drm] Detected VRAM
> RAM=4096M, BAR=256M [drm] RAM width 128bits GDDR5
> [TTM] Zone  kernel: Available graphics memory: 16398342 KiB
> [TTM] Zone   dma32: Available graphics memory: 2097152 KiB
> [TTM] Initializing pool allocator
> [TTM] Initializing DMA pool allocator
> [drm] amdgpu: 4096M of VRAM memory ready
> [drm] amdgpu: 4096M of GTT memory ready.
> [drm] GART: num cpu pages 65536, num gpu pages 65536  

Here that GART print is made from the amdgpu_gart_init function that is
called from gmc_v8_0_gart_init, so it at least proceed until this part
fine:
> r = gmc_v8_0_gart_init(adev);
> if (r)
>     return r;  

And then it then fails again due to a missing firmware.
The -22 value is -EINVAL in the code.
> [drm:gmc_v8_0_hw_init [amdgpu]] *ERROR* Failed to load MC firmware!
> [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* hw_init 1 failed -22
> amdgpu 0000:07:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:07:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:07:00.0: amdgpu: amdgpu: finishing device.  

> Is there anything else I should be looking for besides the "r = ..." 
> statements? The dmesg output has brought me to amdgpu_kms.c and 
> amdgpu_device.c but I couldn't find any relevant statements there.  
The main idea is to start from the first error message, and from there
to find the code to patch. The main issue is that if the firmware
loading fails, the driver treat that as an error. Then if we don't
return the error after failing to load the firmware, it continues.

So at the next boot with that "return r;" patched, it will still print
the same error but continue, so we can ignore the previous error and
try to find a new one.

And here you seem to have found the right place to patch:
> static int gmc_v8_0_hw_init(void *handle)
> {
>     int r;
>     struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> 
>     gmc_v8_0_init_golden_registers(adev);
> 
>     gmc_v8_0_mc_program(adev);
> 
>     if (adev->asic_type == CHIP_TONGA) {
>         r = gmc_v8_0_tonga_mc_load_microcode(adev);
>         if (r) {
>             DRM_ERROR("Failed to load MC firmware!\n");
>             return r;
>         }
>     } else if (adev->asic_type == CHIP_POLARIS11 ||
>             adev->asic_type == CHIP_POLARIS10 ||
>             adev->asic_type == CHIP_POLARIS12) {
>         r = gmc_v8_0_polaris_mc_load_microcode(adev);
>         if (r) {
>             DRM_ERROR("Failed to load MC firmware!\n");
>             return r;
>         }
>     }
> 
>     r = gmc_v8_0_gart_enable(adev);
>     if (r)
>         return r;
> 
>     return r;
> }  

And then here if it has been patched, it should also still print
that "Failed to load MC firmware!", so we can suppose it worked
for now.

So we then need to understand why it fails with -22/-EINVAL here:
> [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* hw_init 1 failed -22  

In amdgpu_device.c we have: "DRM_ERROR("hw_init %d failed %d\n", i,
r);", so it fails there.

If we look the code around it:
> /* need to do gmc hw init early so we can allocate gpu mem */
> if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_GMC) {  
[...]
>     r = adev->ip_blocks[i].version->funcs->hw_init((void *)adev);
>     if (r) {
>         DRM_ERROR("hw_init %d failed %d\n", i, r);
>         goto init_failed;
>     }  
[...]
> }  

Sot here is a hw_init function somewhere and it's in some struct that
has ".type = AMD_IP_BLOCK_TYPE_GMC". Since we're dealing with some
gmc_v8 hardware here, we can try to find it in gmc_v8_0.c as it it
probably there.

And in that file we have:
> const struct amdgpu_ip_block_version gmc_v8_0_ip_block =
> {
>     .type = AMD_IP_BLOCK_TYPE_GMC,
>     .major = 8,
>     .minor = 0,
>     .rev = 0,
>     .funcs = &gmc_v8_0_ip_funcs,
> };
> 
> const struct amdgpu_ip_block_version gmc_v8_1_ip_block =
> {
>     .type = AMD_IP_BLOCK_TYPE_GMC,
>     .major = 8,
>     .minor = 1,
>     .rev = 0,
>     .funcs = &gmc_v8_0_ip_funcs,
> };
> 
> const struct amdgpu_ip_block_version gmc_v8_5_ip_block =
> {
>     .type = AMD_IP_BLOCK_TYPE_GMC,
>     .major = 8,
>     .minor = 5,
>     .rev = 0,
>     .funcs = &gmc_v8_0_ip_funcs,
> };  

So we have the AMD_IP_BLOCK_TYPE_GMC so one of these could match, and
they all 3 have ".funcs = &gmc_v8_0_ip_funcs,". So it's what we are
looking for.

And in the same file we have:
> static const struct amd_ip_funcs gmc_v8_0_ip_funcs = {
> [...]
>     .hw_init = gmc_v8_0_hw_init,
> [...]
> };  

So here we know know that gmc_v8_0_hw_init fails for some reason.

So something is still wrong in that gmc_v8_0_hw_init function that you
patched.

So we have several way it could fail:
> static int gmc_v8_0_hw_init(void *handle)
> {
>     int r;
>     struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> 
>     gmc_v8_0_init_golden_registers(adev);
> 
>     gmc_v8_0_mc_program(adev);  
Until here gmc_v8_0_hw_init can't return an error so we can ignore
that.

>     if (adev->asic_type == CHIP_TONGA) {
>         r = gmc_v8_0_tonga_mc_load_microcode(adev);
>         if (r) {
>             DRM_ERROR("Failed to load MC firmware!\n");
>             return r;  
We can ignore that too since you have a POLARIS11("[drm] initializing
kernel modesetting (POLARIS11 [...]).")

>         }
>     } else if (adev->asic_type == CHIP_POLARIS11 ||
>             adev->asic_type == CHIP_POLARIS10 ||
>             adev->asic_type == CHIP_POLARIS12) {
>         r = gmc_v8_0_polaris_mc_load_microcode(adev);
>         if (r) {
>             DRM_ERROR("Failed to load MC firmware!\n");
>             return r;  
If for some reasons it it's isn't patched it can fail here. Could you
verify in the sources generated by the linux-libre scripts that the
return r is removed/commented here?

>         }
>     }
> 
>     r = gmc_v8_0_gart_enable(adev);
>     if (r)
>         return r;  
It could also fail here.

> 
>     return r;
> }  
And if gmc_v8_0_gart_enable succedded this return r should not return
-22, here so either it fails in gmc_v8_0_gart_enable or it's not
patched for some reasons.

As for gmc_v8_0_gart_enable, it can only fail at one place:
> static int gmc_v8_0_gart_enable(struct amdgpu_device *adev)
> {
>     uint64_t table_addr;
>     int r, i;
>     u32 tmp, field;
> 
>     if (adev->gart.bo == NULL) {
>         dev_err(adev->dev, "No VRAM object for PCIE GART.\n");
>         return -EINVAL;
>     }  
We're not in this case since it would have printed that error otherwise.

>     r = amdgpu_gart_table_vram_pin(adev);
>     if (r)
>         return r;  
Here it can fails.

> [...]
>     DRM_INFO("PCIE GART of %uM enabled (table at 0x%016llX).\n",
>          (unsigned)(adev->gmc.gart_size >> 20),
>          (unsigned long long)table_addr);  

Here DRM_INFO is just a wrapper over printk, so we would see this
print if it reached that part + it would have returned 0 right below:
>     adev->gart.ready = true;
>     return 0;
> }  

In the functions inside amdgpu_gart_table_vram_pin, there is several
code paths that don't print anything when failing so we'd need to dig
deeper to understand what is going on.

So if it fails in amdgpu_gart_table_vram_pin we need to dig deeper.

Are you able to add patches in amdgpu_gart_table_vram_pin to also see
which function fails?

For instance you could transform that:
> r = amdgpu_bo_reserve(adev->gart.bo, false);
> if (unlikely(r != 0))
>    return r;  

In that:
> r = amdgpu_bo_reserve(adev->gart.bo, false);
> if (unlikely(r != 0)) {
>    pr_err("amdgpu_gart_table_vram_pin: amdgpu_bo_reserve => %d\n", r);
>    return r;
> }  

And do something similar for the next functions being called in
amdgpu_gart_table_vram_pin.

And then we'd do that for the functions that fails until we identify
exactly where it fails. 

We would need to do some tests / code modification to understand if we
can work around this error.

Here MC is probably the memory controller of the GPU, and we skipped
its firmware loading. And the GART a way to enable your processor to
access the GPU memory. We probably need that to work somehow to push
pixels into the GPU, but we don't need any of the advanced parts for 3D
acceleration. 

The BIOS/UEFI can also use the GPU (the BIOS/UEFI driver is in a flash
chip inside the GPU) so there is probably a way, the question here is
if we can make it work with as little modifications as possible.

Could you also provide the output of the following command:
> sudo lspci -s 0000:07:00.0 -vvvv -nn  
The PCI / PCI express devices do make some memory available to the OS
and that's visible with this command.

Denis.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://www.fsfla.org/pipermail/linux-libre/attachments/20220502/d94a6c6b/attachment.sig>


More information about the linux-libre mailing list