PCI Memory-Mapped Allocation Algorithm Bug
I've been trying to solve a frustrating issue where the graphics adapter fails to have it's memory mapped into the PCI address space if the laptop has 3GB of RAM rather than 2GB.
I tracked it in kernel bug 10461. The explanation I seem to be unravelling is rather fascinating so I thought I'd record it here.
It affects 32-bit and 64-bit Linux.
GFX fails: PCI 64-bit BAR mapped above 4GB on 32-bit northbridge
Latest working kernel version: None Earliest failing kernel version: 2.6.22 (early kernels not checked) Distribution: kernel.org, Ubuntu Hardware Environment: Sony Vaio VGN-FE41Z (nVidia GeForce 7600, Intel 945PM northbridge) Software Environment: kernel
The laptop ships with 2GB RAM. I recently added 1GB to make a total of 3GB. This is the maximum that can be used due to the 32-bit i945PM chipset (top 1GB reserved for PCI memory-mapped I/O, etc.).
With Linux (x86_64) the integrated nVidia GeForce Go 7600 fails (nv or nvidia drivers) because its 64-bit BAR has been placed above the 4GB boundary, which if I understand correctly is not reachable since the i945PM chip-set can only address up to 4GB (32-bit address bus).
Somehow Windows Vista (32-bit) works correctly in all respects with 3GB RAM and its nVidia drivers so there must be a way to work around this.
[ 0.000000] Linux version 2.6.25-rc9 (tj@hephaestion) (gcc version 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)) #3 SMP PREEMPT Thu Apr 17 02:06:09 BST 2008 [ 0.316940] PCI: Cannot allocate resource region 9 of bridge 0000:00:01.0 [ 0.317038] PCI: Cannot allocate resource region 1 of device 0000:01:00.0 [ 22.016654] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid: [ 22.016655] NVRM: BAR1 is 256M @ 0x00000000 (PCI:0001:00.0) [ 22.016658] NVRM: This is a 64-bit BAR mapped above 4GB by the system BIOS or [ 22.016659] NVRM: Linux kernel. The NVIDIA Linux graphics driver and other [ 22.016660] NVRM: system software do not currently support this configuration [ 22.016662] NVRM: reliably. [ 22.016668] nvidia: probe of 0000:01:00.0 failed with error -1 [ 22.016687] NVRM: The NVIDIA probe routine failed for 1 device(s). [ 22.016690] NVRM: None of the NVIDIA graphics adapters were initialized!
01:00.0 VGA compatible controller: nVidia Corporation G70 [GeForce Go 7600] (rev a1) (prog-if 00 [VGA]) Subsystem: Sony Corporation Unknown device 81ef Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 16 Region 0: Memory at d1000000 (32-bit, non-prefetchable) [size=16M] Region 1: Memory at 100000000 (64-bit, prefetchable) [size=256M] Region 3: Memory at d0000000 (64-bit, non-prefetchable) [size=16M] Region 5: I/O ports at 2000 [size=128]
Notice Region 1 is placed at 4GB - 4.256GB
01:00.0 VGA compatible controller: nVidia Corporation G70 [GeForce Go 7600] (rev a1) (prog-if 00 [VGA]) Subsystem: Sony Corporation Unknown device 81ef Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 0 Interrupt: pin A routed to IRQ 16 Region 0: Memory at d1000000 (32-bit, non-prefetchable) [size=16M] Region 1: Memory at b0000000 (64-bit, prefetchable) [size=256M] Region 3: Memory at d0000000 (64-bit, non-prefetchable) [size=16M] Region 5: I/O ports at 2000 [size=128]
Notice Region 1 is placed at 2.75GB - 3GB
Windows Vista 32-bit doesn't have the problem. Here is Vista's msinfo32 memory map snippet showing the Nvidia PCI BAR below 4GB (actually at 3GB):
0xC0000000-0xFEBFFFFF PCI bus OK 0xC0000000-0xFEBFFFFF Mobile Intel(R) 945GM/GU/PM/GMS/940GML/943GML and Intel(R) 945GT Express PCI Express Root Port - 27A1 OK 0xC0000000-0xFEBFFFFF NVIDIA GeForce Go 7600 OK 0xFED40000-0xFED44FFF PCI bus OK 0xD0000000-0xD1FFFFFF Mobile Intel(R) 945GM/GU/PM/GMS/940GML/943GML and Intel(R) 945GT Express PCI Express Root Port - 27A1 OK 0xD0000000-0xD1FFFFFF NVIDIA GeForce Go 7600 OK 0xD1000000-0xD1FFFFFF NVIDIA GeForce Go 7600 OK
Investigating how Windows Vista might differ from Linux in PCI device configuration I discovered an MS presentation "PCI Express Update for Windows Longhorn" that in its "Memory Resource Assignment" slide (20) says:
- Windows Server 2003
- A PCI device with BIOS configuration above 4GB, is always assigned resources from below the 4GB region
- If no range below 4GB region is available, then the device is assigned a range above the 4GB boundary
- This holds good even if the Windows OS cannot physically access the address range above 4GB
- Longhorn Vista
- A memory address range above 4GB is available for PCI devices only if that range is physically accessible by the OS
- Within this constraint, Windows will always attempt to respect the BIOS configuration on a PCI device.
- A PCI device with 64-bit BARs and no BIOS configuration is still assigned a memory address range above 4GB as available on the parent
The key statement there I noticed being "A memory address range above 4GB is available for PCI devices only if that range is physically accessible by the OS"
Then on slide 22, "BIOS Design Recommendations":
- The ACPI BIOS should describe the memory range above 4GB in the _CRS and/or _PRS of the PCI root bus
- This is described using the QWord Address Space descriptor as defined in the ACPI spec (Section 18.104.22.168.1).
- Windows will evaluate the _SRS method with a buffer in the same format as the _CRS/_PRS
- The memory range for the PCI root bus should not overlap with the physical RAM or some other range
- The memory range is required to be physically accessible by the processor/chipset
- The ACPI BIOS should configure the resources on the PCI devices after evaluating the _OSI method to account for the Server 2003 behavior
- The ACPI BIOS should return an appropriate buffer in the evaluation of the _CRS/_PRS on a PCI root bus to account for the Server 2003 behavior
What caught my eye there is "The memory range is required to be physically accessible by the processor/chipset" because it seems that is the issue in this bug.
I've done an analysis of the PCI allocations between Linux with 2GB, 3GB, and Windows 3GB.
The most obvious thing is that Linux allocates small PCI root port ranges in prime locations.
As I understand it the range allocated to a device must be on a same-size boundary - if the range is 256MB then it should be allocated on a 256MB boundary too. Windows uses the first 256MB of the top 1GB PCI range, 0xC0000000. Linux puts several small ranges in that space. The UHCI and other devices start at 0xD2304000 so the 2nd 256MB window is not available.
The allocation of small ranges in the 0xC0000000-0xCFFFFFFF space forces the 256MB of the GFX card to be placed at 0xB0000000-0xBFFFFFFF when there is 2GB of RAM installed (RAM ends at 0x7FFFFFFF).
With 3GB of RAM installed the end address of RAM is 0xBFFFFFFF. That forces the GFX to be allocated a different range and as the 0xC0000000+ range has those small allocations the only place left is above the 4GB boundary.
From examining the Windows allocation addresses with reference to the PCI specifications it is clear that Windows is using the recommended subtractive decode to allocate ranges (in other words, start at the top of address space and allocate working down) whereas it looks like Linux is using a bottom-up strategy except for one device, the "Mobile PCI Bridge [8086:2448]".
Comparison of Linux vs Windows PCI memory mapped allocations
Linux 2GB Linux 3GB Windows 3GB Address Address Address Device 0xA0000-0xBFFFF PCI bus 0xA0000-0xBFFFF Mobile Intel(R) 945GM/GU/PM/GMS/940GML/943GML and Intel(R) 945GT Express PCI Express Root Port - 0xA0000-0xBFFFF NVIDIA GeForce Go 7600 0xD0000-0xD3FFF PCI bus 0xD4000-0xD7FFF PCI bus 0xD8000-0xDBFFF PCI bus 0xC0000000-0xFEBFFFFF PCI bus 0x0b0000000-0bfffffff 0x0100000000-010fffffff 0xC0000000-0xFEBFFFFF Mobile Intel(R) 945GM/GU/PM/GMS/940GML/943GML and Intel(R) 945GT Express PCI Express Root Port - 0xb0000000 0x100000000 0xC0000000-0xFEBFFFFF NVIDIA GeForce Go 7600 0xFED40000-0xFED44FFF PCI bus 0xd0000000-d1ffffff 0xd0000000-d1ffffff 0xD0000000-0xD1FFFFFF Mobile Intel(R) 945GM/GU/PM/GMS/940GML/943GML and Intel(R) 945GT Express PCI Express Root Port - 0xd0000000 0xd0000000 0xD0000000-0xD1FFFFFF NVIDIA GeForce Go 7600 0xd1000000 0xd1000000 0xD1000000-0xD1FFFFFF NVIDIA GeForce Go 7600 0xd2300000 0xd2300000 0xD2300000-0xD2303FFF High Definition Audio Controller 0xc8000000-c9ffffff 0xc8000000-c9ffffff 0xFCC00000-0xFEBFFFFF Intel(R) 82801G (ICH7 Family) PCI Express Root Port - 27D0 0x0c0000000-0c1ffffff 0x0c0000000-0c1ffffff 0xF4800000-0xF67FFFFF Intel(R) 82801G (ICH7 Family) PCI Express Root Port - 27D0 0xca000000-cbffffff 0xca000000-cbffffff 0xFAC00000-0xFCBFFFFF Intel(R) 82801G (ICH7 Family) PCI Express Root Port - 27D2 0x0c2000000-0c3ffffff 0x0c2000000-0c3ffffff 0xF2800000-0xF47FFFFF Intel(R) 82801G (ICH7 Family) PCI Express Root Port - 27D2 0xcc000000-cdffffff 0xcc000000-cdffffff 0xF8C00000-0xFABFFFFF Intel(R) 82801G (ICH7 Family) PCI Express Root Port - 27D4 0c4000000-0c5ffffff 0x0c4000000-0c5ffffff 0xF0800000-0xF27FFFFF Intel(R) 82801G (ICH7 Family) PCI Express Root Port - 27D4 0xcc000000 0xcc000000 0xFABFF000-0xFABFFFFF Intel(R) PRO/Wireless 3945ABG Network Connection 0xce000000-cfffffff 0xce000000-cfffffff 0xF6C00000-0xF8BFFFFF Intel(R) 82801G (ICH7 Family) PCI Express Root Port - 27D6 0c6000000-0c7ffffff 0x0c6000000-0c7ffffff 0xDE000000-0xDFFFFFFF Intel(R) 82801G (ICH7 Family) PCI Express Root Port - 27D6 0xd2304000 0xd2304000 0xD2304000-0xD23043FF Intel(R) 82801G (ICH7 Family) USB2 Enhanced Host Controller - 27CC 0xd2000000-d20fffff 0xd2000000-d20fffff 0xF6800000-0xF6BFFFFF Intel(R) 82801 PCI Bridge - 2448 0x8c000000-8ffff000 0xd4000000-d7fff000 0xF6BFD000-0xF6BFDFFF Texas Instruments PCI-8x12/7x12/6x12 CardBus Controller 0xF6BFE000-0xF6BFEFFF Texas Instruments PCI-8x12/7x12/6x12 CardBus Controller 0x88000000-8bfff000 0xd8000000-dbfff000 0xF6BFF000-0xF6BFFFFF Texas Instruments PCI-8x12/7x12/6x12 CardBus Controller 0xd2007000 0xd2007000 0xFED44000-0xFED44FFF Texas Instruments PCI-8x12/7x12/6x12 CardBus Controller 0xd2006000 0xd2006000 0xF6BF6800-0xF6BF6FFF Texas Instruments OHCI Compliant IEEE 1394 Host Controller 0xd2000000 0xd2000000 0xF6BF8000-0xF6BFBFFF Texas Instruments OHCI Compliant IEEE 1394 Host Controller 0xd2004000 0xd2004000 0xF6BF7000-0xF6BF7FFF Texas Instruments PCIxx12 Integrated FlashMedia Controller 0xd2005000 0xd2005000 0xF6BFC000-0xF6BFCFFF Intel(R) PRO/100 VE Network Connection 0xFF000000-0xFFFFFFFF Intel(R) 82802 Firmware Hub Device 0xFED00000-0xFED003FF High Precision Event Timer 0xd2304400 0xd2304400 0xD2304400-0xD23047FF Intel(R) 82801GBM/GHM (ICH7-M Family) Serial ATA Storage Controller - 27C4 0xE0000000-0xEFFFFFFF Motherboard resources 0xFED14000-0xFED17FFF Motherboard resources 0xFED18000-0xFED18FFF Motherboard resources 0xFED19000-0xFED19FFF Motherboard resources 0xFED1C000-0xFED1FFFF Motherboard resources 0xFED20000-0xFED3FFFF Motherboard resources 0xFED45000-0xFED8FFFF Motherboard resources
Cause and Potential Solution
After a few debug printk() runs watching the allocation strategy I wondered why the PCI resources region doesn't start at the beginning of the largest gap:
[ 0.000000] Allocating PCI resources starting at c2000000 (gap:c0000000:20000000)
since, when 3GB RAM is installed, the gap starts at 0xC0000000 but the allocation region begins at 0xC2000000.
The other issue is that there are several gaps in the top 1GB range but only the largest gap seems to be used for PCI IOMEM allocations, which explains why smaller allocations for other devices effectively choke off use of the range in 32-bit address space.
In contrast, from looking at the addresses in the allocation comparison with Windows, it looks as if Windows uses *all* gaps for allocation rather than just the largest. It is noticeable that Windows allocates smaller regions in the gaps between the various 'high' e820 reservations.
In looking for the origins of the gap-rounding code I eventually found commit f0eca9626c6becb6fc56106b2e4287c6c784af3d from 2005-09-09:
[PATCH] Update PCI IOMEM allocation start This fixes the problem with "Averatec 6240 pcmcia_socket0: unable to apply power", which was due to the CardBus IOMEM register region being allocated at an address that was actually inside the RAM window that had been reserved for video frame-buffers in an UMA setup.
This introduces a simple 'rounding up' algorithm to create a 'gap' between top of system RAM and beginning of PCI IOMEM as a guard against unintentional over-writes.
The algorithm used was suggested in an example by Linus Torvalds with some provisos but was adopted verbatim in the patch for the Averatec bug. In his email, Linus went on to say:
The other alternative is to make PCI allocations generally start at the high range of the allowable - judging by the lspci listings I've seem from people under Windows, that seems to be what Windows does, which might be a good idea (ie the closer we match windows allocation patterns, the more likely we're to not hit some unmarked region - because windows testing would have hit it too)."
That comment reflects my findings in dealing with this bug. Looking at the bug there are four issues:
- No 256MB region on a 256MB boundary available for the GFX IOMEM in the single largest PCI IOMEM region.
- The first available 256MB region on a 256MB boundary is unusable because pci_mem_start is being 'rounded up' to gap_start + round.
- Multiple gaps higher in the address space are left unused whereas Windows uses them for smaller allocations thus keeping the largest gap free for the devices with large requirements.
- Resources aren't being allocated top-down (subtractive decode) as recommended in PCI specs and Intel chipset datasheets, and done by Windows.
Mobile Intel 945 Express Chipset Family Datasheet, section 9.2 Main Memory Address Range. figure 13 PCI Memory Address Range
If  was implemented in addition to  the smaller allocations would be at the top of the 32-bit address space much like Windows.
Implementing  and  together should avoid the need for commit f0eca962 (Cardbus IOMEM in shared video RAM space) since the Cardbus IOMEM would be in a 'high' gap (as it would be with Windows).
Dropping commit f0eca962 would solve  since the GFX could allocate 256MB on the 256MB boundary at 0xC0000000 in the largest gap.
There might be an issue if a system has an undeclared shared video memory region and another PCI device that needs a large allocation, but if there was surely Windows would be affected by it too.
Also, Linus' mention of maintaining an unused gap between top-of-RAM and bottom-of-PCI-IOMEM needs to be considered. Would implementation of  and  negate the need for it? Windows doesn't maintain a similar gap - is there a reason that Linux should?
Proposal for Rewriting Linux PCI allocation
I'm exploring what is required to replace the existing allocation algorithms with something more sophisticated and inline with the chipset recommendations and practice dictated by Windows. I've moved the documentation for that proposal to PCI Dynamic Resource Allocation Management
These are references to source code that I am reviewing in dealing with this bug.
Previously known as:
Code-path during resource allocation where failure is reported:
arch/x86/pci/i386.c::pcibios_resource_survey() arch/x86/pci/i386.c::pcibios_allocate_bus_resources() drivers/pci/pci.c::pci_find_parent_resource() kernel/resource.c::request_resource() // failure reports: [ 0.346211] PCI: Cannot allocate resource region 9 of bridge 0000:00:01.0 arch/x86/pci/i386.c::pcibios_allocate_bus_resources() // re-entrant arch/x86/pci/i386.c::pcibios_allocate_resources(0) arch/x86/pci/i386.c::pcibios_allocate_resources(1) include/linux/pci.h::pci_read_config_word() drivers/pci/pci.c::pci_find_parent_resource() kernel/resource.c::request_resource() // failure reports: [ 0.350302] PCI: Cannot allocate resource region 1 of device 0000:01:00.0
The code-path taken during resource allocation to devices (not the earlier path that fails to allocate to the 64-bit BAR):
drivers/pci/setup-bus.c::pbus_assign_resources_sorted() drivers/pci/setup-res.c::pdev_sort_resources() drivers/pci/setup-res.c::pci_assign_resource() drivers/pci/pci_bus_alloc_resource() kernel/resource.c::allocate_resource() kernel/resource.c::find_resource() kernel/resource.c::__request_resource()