cannot boot with linux-libre>=5.7, amdgpu and cryptsetup

Denis 'GNUtoo' Carikli GNUtoo at cyberdimension.org
Fri Jul 17 19:28:27 UTC 2020


Hi,

Thanks a lot for posting the log.

I'll comment the amdgpu-raven.log.gz below.

> [   73.434823] BUG: kernel NULL pointer dereference, address:
> 00000000000000b0
[...]
Here you have a kernel crash. It is directly related to the loading of
the amdgpu driver. We can see that in the log:
> [   73.434900]  drm_modeset_register_all+0x13/0x70 [drm]
> [   73.434911]  drm_dev_register+0x160/0x190 [drm]
> [   73.434989]  amdgpu_pci_probe+0x100/0x180 [amdgpu]
Even if it's a kernel bug, we probably want to ignore it for now and see
if it still crashes if/when we manage to fix the driver to work without
firmware.

I've slightly modified the following part of the log to fit the
characters ~70 character limit per line:
> 0000:03:00.0: Missing Free firmware (non-Free firmware loading is
>                                      disabled)
> amdgpu 0000:03:00.0: Failed to load gpu_info
>                      firmware "/*(DEBLOBBED)*/"
> amdgpu 0000:03:00.0: Fatal error during GPU init
Here is the first problem to solve.

Here we have better chances of making it work if we understand what
we are doing because it's the first time we are trying to make the
amdgpu driver work with linux-libre, and it seems to be a bit different
from the radeon driver we are used to, even if many parts seem similar.

I looked into Linux source code and that firmware is a binary that
contains information on the hardware in binary format. Linux then use
it to populate code structures that have information on the hardware.

In drivers/gpu/drm/amd/amdgpu/amdgpu_device.c we have:
> err = request_firmware(&adev->firmware.gpu_info_fw, fw_name,
>                        adev->dev);
So here it requests the gpu_info firmware, and Linux can then have
access to it.

Later we have:
> err = amdgpu_ucode_validate(adev->firmware.gpu_info_fw);
Here it probably checks that it's a "gpu_info" firmware and not a
random file.

> hdr = (const struct gpu_info_firmware_header_v1_0 *)
>        adev->firmware.gpu_info_fw->data;
> amdgpu_ucode_print_gpu_info_hdr(&hdr->header);
Here, that part is interesting as it interpret the firmware binary that
is in memory as a "struct gpu_info_firmware_header_v1_0".

If we look at the definition of this struct in
drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.h we have:
> struct gpu_info_firmware_v1_0 {
>         uint32_t gc_num_se;
>         uint32_t gc_num_cu_per_sh;
>         uint32_t gc_num_sh_per_se;
>         uint32_t gc_num_rb_per_se;
>         uint32_t gc_num_tccs;
>         uint32_t gc_num_gprs;
>         uint32_t gc_num_max_gs_thds;
>         uint32_t gc_gs_table_depth;
>         uint32_t gc_gsprim_buff_depth;
>         uint32_t gc_parameter_cache_depth;
>         uint32_t gc_double_offchip_lds_buffer;
>         uint32_t gc_wave_size;
>         uint32_t gc_max_waves_per_simd;
>         uint32_t gc_max_scratch_slots_per_cu;
>         uint32_t gc_lds_size;
> };

So now we need to understand what are these.

I've tried with the first field:
> amdgpu/amdgpu_device.c has:
> adev->gfx.config.max_shader_engines =
>                                le32_to_cpu(gpu_info_fw->gc_num_se);

So even if this firmware looks trivial to replace, all the fields seem
related to the 3D acceleration, so at first we can just try patching
the code not to report an error if the loading fails.

So if we go back to the error we had:
> 0000:03:00.0: Missing Free firmware (non-Free firmware loading is
>                                      disabled)
> amdgpu 0000:03:00.0: Failed to load gpu_info
>                      firmware "/*(DEBLOBBED)*/"
> amdgpu 0000:03:00.0: Fatal error during GPU init
Here it prints "Failed to load gpu_info firmware".

That print is made by that code:
> snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_gpu_info.bin",
> chip_name); err = request_firmware(&adev->firmware.gpu_info_fw,
> fw_name, adev->dev);
> if (err) {
>     dev_err(adev->dev,
>             "Failed to load gpu_info firmware \"%s\"\n",
>             fw_name);
>     goto out;
> }

So you can try to patch it by replacing the "goto out;" by the usual
"/*(DEBLOBBED)*/" and try out the result and paste the corresponding
log in this mailing list. Having only this modification and the
corresponding log would enable us to progress further.

Note that dev_err is just a function that is used to print error
messages, so we can ignore it.

There is also many similar functions like dev_info and many other. 

They are called dev_something because the "adev->dev" is a data
structure about the device which has information like the driver name
(amdgpu), the PCI device (0000:03:00.0) and other things.
So at the end it can print:
> amdgpu 0000:03:00.0: Failed to load gpu_info firmware
instead of just:
> Failed to load gpu_info firmware

Replacing the "goto out;" will probably still result in a black
screen because it will probably fail again at the next firmware loading
code, however with the log, we will be able to continue finding and
fixing the next firmware loading code.

If the failure is look similar to the radeon driver, you may also want
to try advancing on it yourself and then send the resulting patches and
corresponding logs this mailing list.

If not, me or other people interested can also help analyzing the logs
from the "goto out;" replacement and we can try to see together what
part to patch next.

Denis.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://www.fsfla.org/pipermail/linux-libre/attachments/20200717/7346e54f/attachment.sig>


More information about the linux-libre mailing list