cannot boot with linux-libre>=5.7, amdgpu and cryptsetup

Sat Jul 25 02:20:02 UTC 2020

On Fri, 24 Jul 2020 04:32:38 +0200 (CEST)
Edgar Lux <edgarlux at mailfence.com> wrote:

> If you do a grep on the linux-libre source, that line does not exist:
[...]
> I followed the instructions from
> #+begin_EXAMPLE
>   I've also a more in depth howto here:
>   https://libreplanet.org/wiki/Group:Hardware/research/gpu/radeon/How_to_patch_and_test_linux-libre_in_Parabola
> #+end_EXAMPLE
[...]
> Basically,
> 1. I cloned linux-libre
[...]
> 8. Modified my boot loader and restarted
Thanks, 

I'll try to update the instruction after we manage to make that driver
work (or fail at trying).

> Still no joy. I am attaching the log :) . Let me know what is next.
Thanks.

I'll comment on it below. I've removed the timestamps to make it easier
to read.

> [drm] amdgpu kernel modesetting enabled.
[...]
> fb0: switching to amdgpudrmfb from EFI VGA
> Console: switching to colour dummy device 80x25
> amdgpu 0000:03:00.0: vgaarb: deactivate vga console
Here it get further in the probe: it now tried to initialize the GPU
for real instead of bailing out due to some missing firmware.

And then it crash due to a bug that is different from the previous crash
we had:
> BUG: kernel NULL pointer dereference, address: 00000000000000a0
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0 
> Oops: 0000 [#1] PREEMPT SMP NOPTI
[...]
>  amdgpu_driver_load_kms+0x8c/0x1d0 [amdgpu]
So here we know that it crashed in the amdgpu_driver_load_kms function
due to some data structures that are NULL.

This is a well known issue in C.

Let's say you have a definition of a data structure like that:
> struct dev {
>    int bus_id;
> };
That only defines how data should look like in memory.

So if you want to use it you could do that:
> struct dev my_device; // This reserves a memory region for a "struct dev"
> my_device.bus_id = 1; // this sets the bus_id field to 1
> printf("%d\n", my_device.bus_id); // this prints the bus id
Here it will "1".

As people want to use functions, they typically pass the memory address
between different functions instead of copying the data.

So if you write a function that gets the device ID:
> int get_bus_id_1(struct device *dev)
> {
>     return dev->bus_id;
> }

You typically pass it the memory address of the struct (here my_device):
> int bus_id = get_bus_id(&my_device);

But if instead you pass it a bogus address:
> struct dev *my_device = 0; // points to address 0
> int bus_id = get_bus_id(&my_device);

It can crash when running the "dev->bus_id".

So we have a crash like that in amdgpu_driver_load_kms.

Here's amdgpu_driver_load_kms before amdgpu_device_init:
> int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long
> flags)
> {
> 	struct amdgpu_device *adev;
> 	int r, acpi_status;
> 
> 	adev = kzalloc(sizeof(struct amdgpu_device), GFP_KERNEL);
> 	if (adev == NULL) {
> 		return -ENOMEM;
> 	}
> 	dev->dev_private = (void *)adev;
dev->dev_private could crash but it's very unlikely.

> 	if (amdgpu_has_atpx() &&
> 	    (amdgpu_is_atpx_hybrid() ||
> 	     amdgpu_has_atpx_dgpu_power_cntl()) &&
> 	    ((flags & AMD_IS_APU) == 0) &&
> 	    !pci_is_thunderbolt_attached(dev->pdev))
> 		flags |= AMD_IS_PX;
dev->pdev could crash but it's very unlikely.

Then we have that, and it's commented in your patch:
> r = amdgpu_device_init(adev, dev, dev->pdev, flags);
> if (r) {
> 	dev_err(&dev->pdev->dev, "Fatal error during GPU init\n");
>       goto out;
> }
We need to uncomment it to go further.

> if (amdgpu_device_supports_boco(dev) &&
>     (amdgpu_runtime_pm != 0)) /* enable runpm by default for boco */
> 	adev->runpm = true;
> else if (amdgpu_device_supports_baco(dev) &&
> 	 (amdgpu_runtime_pm != 0) &&
> 	 (adev->asic_type >= CHIP_TOPAZ) &&
> 	 (adev->asic_type != CHIP_VEGA10) &&
> 	 (adev->asic_type != CHIP_VEGA20) &&
> 	 (adev->asic_type != CHIP_ARCTURUS)) /* enable runpm on VI+ */
> 	adev->runpm = true;
> else if (amdgpu_device_supports_baco(dev) &&
> 	 (amdgpu_runtime_pm > 0))  /* enable runpm if runpm=1 on CI */
> 	adev->runpm = true;
That part is unlikely to fail as it only does simple comparisons and
assignments.

> /* Call ACPI methods: require modeset init
>  * but failure is not fatal
>  */
> if (!r) {
> 	acpi_status = amdgpu_acpi_init(adev);
This is probably where it fails. That function access things like
"adev->ddev->mode_config.encoder_list". However since the
amdgpu_device_init is commented, some data structures are probably not
set correctly, which makes it crash.

For instance amdgpu_device_init does "adev->ddev = ddev;".

While writing this mail I saw that commenting reject_firmware wasn't
enough as then the firmware data is used to populate some data
structures and not doing that is supposed to crash the driver if I
understood well.

So instead I'll just comment that firmware stuff and populate the data
structures with dummy values for now.

Denis.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://www.fsfla.org/pipermail/linux-libre/attachments/20200725/60979d2c/attachment.sig>