wiki:Linux/Kernel/Bug/PciAllocationAlgorithm

PCI Memory-Mapped Allocation Algorithm Bug

See also:

PCI Dynamic Resource Allocation Management
PCI Resource Allocation

I've been trying to solve a frustrating issue where the graphics adapter fails to have it's memory mapped into the PCI address space if the laptop has 3GB of RAM rather than 2GB.

I tracked it in  kernel bug 10461. The explanation I seem to be unravelling is rather fascinating so I thought I'd record it here.

It affects 32-bit and 64-bit Linux.

GFX fails: PCI 64-bit BAR mapped above 4GB on 32-bit northbridge

Latest working kernel version: None
Earliest failing kernel version: 2.6.22 (early kernels not checked)
Distribution: kernel.org, Ubuntu
Hardware Environment: Sony Vaio VGN-FE41Z (nVidia GeForce 7600, Intel 945PM
northbridge)
Software Environment: kernel

The laptop ships with 2GB RAM. I recently added 1GB to make a total of 3GB. This is the maximum that can be used due to the 32-bit i945PM chipset (top 1GB reserved for PCI memory-mapped I/O, etc.).

With Linux (x86_64) the integrated nVidia GeForce Go 7600 fails (nv or nvidia drivers) because its 64-bit BAR has been placed above the 4GB boundary, which if I understand correctly is not reachable since the i945PM chip-set can only address up to 4GB (32-bit address bus).

Somehow Windows Vista (32-bit) works correctly in all respects with 3GB RAM and its nVidia drivers so there must be a way to work around this.

3GB RAM

[    0.000000] Linux version 2.6.25-rc9 (tj@hephaestion) (gcc version 4.1.3 
20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)) #3 SMP PREEMPT Thu Apr 17
02:06:09 BST 2008

[    0.316940] PCI: Cannot allocate resource region 9 of bridge 0000:00:01.0
[    0.317038] PCI: Cannot allocate resource region 1 of device 0000:01:00.0

[   22.016654] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[   22.016655] NVRM: BAR1 is 256M @ 0x00000000 (PCI:0001:00.0)
[   22.016658] NVRM: This is a 64-bit BAR mapped above 4GB by the system BIOS or
[   22.016659] NVRM: Linux kernel. The NVIDIA Linux graphics driver and other
[   22.016660] NVRM: system software do not currently support this configuration
[   22.016662] NVRM: reliably.
[   22.016668] nvidia: probe of 0000:01:00.0 failed with error -1
[   22.016687] NVRM: The NVIDIA probe routine failed for 1 device(s).
[   22.016690] NVRM: None of the NVIDIA graphics adapters were initialized!

lspci reports:

01:00.0 VGA compatible controller: nVidia Corporation G70 [GeForce Go 7600]
(rev a1) (prog-if 00 [VGA])
        Subsystem: Sony Corporation Unknown device 81ef
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at d1000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at 100000000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at d0000000 (64-bit, non-prefetchable) [size=16M]
        Region 5: I/O ports at 2000 [size=128]

Notice Region 1 is placed at 4GB - 4.256GB

2GB RAM

01:00.0 VGA compatible controller: nVidia Corporation G70 [GeForce Go 7600]
(rev a1) (prog-if 00 [VGA])
        Subsystem: Sony Corporation Unknown device 81ef
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
        Latency: 0
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at d1000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at b0000000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at d0000000 (64-bit, non-prefetchable) [size=16M]
        Region 5: I/O ports at 2000 [size=128]

Notice Region 1 is placed at 2.75GB - 3GB

Windows Vista

Windows Vista 32-bit doesn't have the problem. Here is Vista's msinfo32 memory map snippet showing the Nvidia PCI BAR below 4GB (actually at 3GB):

0xC0000000-0xFEBFFFFF   PCI bus OK      
0xC0000000-0xFEBFFFFF   Mobile Intel(R) 945GM/GU/PM/GMS/940GML/943GML and Intel(R) 945GT Express PCI Express Root Port - 27A1   OK      
0xC0000000-0xFEBFFFFF   NVIDIA GeForce Go 7600  OK      
0xFED40000-0xFED44FFF   PCI bus OK      
0xD0000000-0xD1FFFFFF   Mobile Intel(R) 945GM/GU/PM/GMS/940GML/943GML and Intel(R) 945GT Express PCI Express Root Port - 27A1   OK      
0xD0000000-0xD1FFFFFF   NVIDIA GeForce Go 7600  OK      
0xD1000000-0xD1FFFFFF   NVIDIA GeForce Go 7600  OK      

Investigating how Windows Vista might differ from Linux in PCI device configuration I discovered an MS presentation "PCI Express Update for Windows Longhorn" that in its "Memory Resource Assignment" slide (20) says:

  • Windows Server 2003
    • A PCI device with BIOS configuration above 4GB, is always assigned resources from below the 4GB region
    • If no range below 4GB region is available, then the device is assigned a range above the 4GB boundary
      • This holds good even if the Windows OS cannot physically access the address range above 4GB
  • Longhorn Vista
    • A memory address range above 4GB is available for PCI devices only if that range is physically accessible by the OS
    • Within this constraint, Windows will always attempt to respect the BIOS configuration on a PCI device.
    • A PCI device with 64-bit BARs and no BIOS configuration is still assigned a memory address range above 4GB as available on the parent

The key statement there I noticed being "A memory address range above 4GB is available for PCI devices only if that range is physically accessible by the OS"

Then on slide 22, "BIOS Design Recommendations":

  • The ACPI BIOS should describe the memory range above 4GB in the _CRS and/or _PRS of the PCI root bus
    • This is described using the QWord Address Space descriptor as defined in the ACPI spec (Section 6.4.3.5.1).
    • Windows will evaluate the _SRS method with a buffer in the same format as the _CRS/_PRS
  • The memory range for the PCI root bus should not overlap with the physical RAM or some other range
  • The memory range is required to be physically accessible by the processor/chipset
  • The ACPI BIOS should configure the resources on the PCI devices after evaluating the _OSI method to account for the Server 2003 behavior
  • The ACPI BIOS should return an appropriate buffer in the evaluation of the _CRS/_PRS on a PCI root bus to account for the Server 2003 behavior

What caught my eye there is "The memory range is required to be physically accessible by the processor/chipset" because it seems that is the issue in this bug.

Comparison

I've done an analysis of the PCI allocations between Linux with 2GB, 3GB, and Windows 3GB.

The most obvious thing is that Linux allocates small PCI root port ranges in prime locations.

As I understand it the range allocated to a device must be on a same-size boundary - if the range is 256MB then it should be allocated on a 256MB boundary too. Windows uses the first 256MB of the top 1GB PCI range, 0xC0000000. Linux puts several small ranges in that space. The UHCI and other devices start at 0xD2304000 so the 2nd 256MB window is not available.

The allocation of small ranges in the 0xC0000000-0xCFFFFFFF space forces the 256MB of the GFX card to be placed at 0xB0000000-0xBFFFFFFF when there is 2GB of RAM installed (RAM ends at 0x7FFFFFFF).

With 3GB of RAM installed the end address of RAM is 0xBFFFFFFF. That forces the GFX to be allocated a different range and as the 0xC0000000+ range has those small allocations the only place left is above the 4GB boundary.

From examining the Windows allocation addresses with reference to the PCI specifications it is clear that Windows is using the recommended subtractive decode to allocate ranges (in other words, start at the top of address space and allocate working down) whereas it looks like Linux is using a bottom-up strategy except for one device, the "Mobile PCI Bridge [8086:2448]".

Comparison of Linux vs Windows PCI memory mapped allocations

Linux 2GB                 Linux 3GB                   Windows 3GB               
Address                   Address                     Address                   Device                                                                                           
                                                      0xA0000-0xBFFFF           PCI bus                                                                                          
                                                      0xA0000-0xBFFFF           Mobile Intel(R) 945GM/GU/PM/GMS/940GML/943GML and Intel(R) 945GT Express PCI Express Root Port - 
                                                      0xA0000-0xBFFFF           NVIDIA GeForce Go 7600                                                                           
                                                      0xD0000-0xD3FFF           PCI bus                                                                                          
                                                      0xD4000-0xD7FFF           PCI bus                                                                                          
                                                      0xD8000-0xDBFFF           PCI bus                                                                                          
                                                      0xC0000000-0xFEBFFFFF     PCI bus                                                                                          
0x0b0000000-0bfffffff     0x0100000000-010fffffff     0xC0000000-0xFEBFFFFF     Mobile Intel(R) 945GM/GU/PM/GMS/940GML/943GML and Intel(R) 945GT Express PCI Express Root Port - 
0xb0000000                0x100000000                 0xC0000000-0xFEBFFFFF     NVIDIA GeForce Go 7600                                                                           
                                                      0xFED40000-0xFED44FFF     PCI bus                                                                                          
0xd0000000-d1ffffff       0xd0000000-d1ffffff         0xD0000000-0xD1FFFFFF     Mobile Intel(R) 945GM/GU/PM/GMS/940GML/943GML and Intel(R) 945GT Express PCI Express Root Port - 
0xd0000000                0xd0000000                  0xD0000000-0xD1FFFFFF     NVIDIA GeForce Go 7600                                                                           
0xd1000000                0xd1000000                  0xD1000000-0xD1FFFFFF     NVIDIA GeForce Go 7600                                                                           
0xd2300000                0xd2300000                  0xD2300000-0xD2303FFF     High Definition Audio Controller                                                                 
0xc8000000-c9ffffff       0xc8000000-c9ffffff         0xFCC00000-0xFEBFFFFF     Intel(R) 82801G (ICH7 Family) PCI Express Root Port - 27D0                                       
0x0c0000000-0c1ffffff     0x0c0000000-0c1ffffff       0xF4800000-0xF67FFFFF     Intel(R) 82801G (ICH7 Family) PCI Express Root Port - 27D0                                       
0xca000000-cbffffff       0xca000000-cbffffff         0xFAC00000-0xFCBFFFFF     Intel(R) 82801G (ICH7 Family) PCI Express Root Port - 27D2                                       
0x0c2000000-0c3ffffff     0x0c2000000-0c3ffffff       0xF2800000-0xF47FFFFF     Intel(R) 82801G (ICH7 Family) PCI Express Root Port - 27D2                                       
0xcc000000-cdffffff       0xcc000000-cdffffff         0xF8C00000-0xFABFFFFF     Intel(R) 82801G (ICH7 Family) PCI Express Root Port - 27D4                                       
0c4000000-0c5ffffff       0x0c4000000-0c5ffffff       0xF0800000-0xF27FFFFF     Intel(R) 82801G (ICH7 Family) PCI Express Root Port - 27D4                                       
0xcc000000                0xcc000000                  0xFABFF000-0xFABFFFFF     Intel(R) PRO/Wireless 3945ABG Network Connection                                                 
0xce000000-cfffffff       0xce000000-cfffffff         0xF6C00000-0xF8BFFFFF     Intel(R) 82801G (ICH7 Family) PCI Express Root Port - 27D6                                       
0c6000000-0c7ffffff       0x0c6000000-0c7ffffff       0xDE000000-0xDFFFFFFF     Intel(R) 82801G (ICH7 Family) PCI Express Root Port - 27D6                                       
0xd2304000                0xd2304000                  0xD2304000-0xD23043FF     Intel(R) 82801G (ICH7 Family) USB2 Enhanced Host Controller - 27CC                               
0xd2000000-d20fffff       0xd2000000-d20fffff         0xF6800000-0xF6BFFFFF     Intel(R) 82801 PCI Bridge - 2448                                                                 
0x8c000000-8ffff000       0xd4000000-d7fff000         0xF6BFD000-0xF6BFDFFF     Texas Instruments PCI-8x12/7x12/6x12 CardBus Controller                                          
                                                      0xF6BFE000-0xF6BFEFFF     Texas Instruments PCI-8x12/7x12/6x12 CardBus Controller                                          
0x88000000-8bfff000       0xd8000000-dbfff000         0xF6BFF000-0xF6BFFFFF     Texas Instruments PCI-8x12/7x12/6x12 CardBus Controller                                          
0xd2007000                0xd2007000                  0xFED44000-0xFED44FFF     Texas Instruments PCI-8x12/7x12/6x12 CardBus Controller                                          
0xd2006000                0xd2006000                  0xF6BF6800-0xF6BF6FFF     Texas Instruments OHCI Compliant IEEE 1394 Host Controller                                       
0xd2000000                0xd2000000                  0xF6BF8000-0xF6BFBFFF     Texas Instruments OHCI Compliant IEEE 1394 Host Controller                                       
0xd2004000                0xd2004000                  0xF6BF7000-0xF6BF7FFF     Texas Instruments PCIxx12 Integrated FlashMedia Controller                                       
0xd2005000                0xd2005000                  0xF6BFC000-0xF6BFCFFF     Intel(R) PRO/100 VE Network Connection                                                           
                                                      0xFF000000-0xFFFFFFFF     Intel(R) 82802 Firmware Hub Device                                                               
                                                      0xFED00000-0xFED003FF     High Precision Event Timer                                                                       
0xd2304400                0xd2304400                  0xD2304400-0xD23047FF     Intel(R) 82801GBM/GHM (ICH7-M Family) Serial ATA Storage Controller - 27C4                       
                                                      0xE0000000-0xEFFFFFFF     Motherboard resources                                                                            
                                                      0xFED14000-0xFED17FFF     Motherboard resources                                                                            
                                                      0xFED18000-0xFED18FFF     Motherboard resources                                                                            
                                                      0xFED19000-0xFED19FFF     Motherboard resources                                                                            
                                                      0xFED1C000-0xFED1FFFF     Motherboard resources                                                                            
                                                      0xFED20000-0xFED3FFFF     Motherboard resources                                                                            
                                                      0xFED45000-0xFED8FFFF     Motherboard resources                                                                            

Cause and Potential Solution

After a few debug printk() runs watching the allocation strategy I wondered why the PCI resources region doesn't start at the beginning of the largest gap:

[    0.000000] Allocating PCI resources starting at c2000000 (gap:c0000000:20000000)

since, when 3GB RAM is installed, the gap starts at 0xC0000000 but the allocation region begins at 0xC2000000.

The other issue is that there are several gaps in the top 1GB range but only the largest gap seems to be used for PCI IOMEM allocations, which explains why smaller allocations for other devices effectively choke off use of the range in 32-bit address space.

In contrast, from looking at the addresses in the allocation comparison with Windows, it looks as if Windows uses *all* gaps for allocation rather than just the largest. It is noticeable that Windows allocates smaller regions in the gaps between the various 'high' e820 reservations.

In looking for the origins of the gap-rounding code I eventually found commit f0eca9626c6becb6fc56106b2e4287c6c784af3d from 2005-09-09:

[PATCH] Update PCI IOMEM allocation start

    This fixes the problem with "Averatec 6240 pcmcia_socket0: unable to
    apply power", which was due to the CardBus IOMEM register region being
    allocated at an address that was actually inside the RAM window that had
    been reserved for video frame-buffers in an UMA setup.

This introduces a simple 'rounding up' algorithm to create a 'gap' between top of system RAM and beginning of PCI IOMEM as a guard against unintentional over-writes.

The algorithm used was suggested in an example by Linus Torvalds with some provisos but was adopted verbatim in the patch for the Averatec bug. In his email,  Linus went on to say:

The other alternative is to make PCI allocations generally start at the high range of the allowable - judging by the lspci listings I've seem from people under Windows, that seems to be what Windows does, which might be a good idea (ie the closer we match windows allocation patterns, the more likely we're to not hit some unmarked region - because windows testing would have hit it too)."

That comment reflects my findings in dealing with this bug. Looking at the bug there are four issues:

  1. No 256MB region on a 256MB boundary available for the GFX IOMEM in the single largest PCI IOMEM region.
  1. The first available 256MB region on a 256MB boundary is unusable because pci_mem_start is being 'rounded up' to gap_start + round.
  1. Multiple gaps higher in the address space are left unused whereas Windows uses them for smaller allocations thus keeping the largest gap free for the devices with large requirements.
  1. Resources aren't being allocated top-down (subtractive decode) as recommended in PCI specs and Intel chipset datasheets, and done by Windows.

Intel 945PM PCI Memory Address Range

 Mobile Intel 945 Express Chipset Family Datasheet, section 9.2 Main Memory Address Range. figure 13 PCI Memory Address Range

If [3] was implemented in addition to [4] the smaller allocations would be at the top of the 32-bit address space much like Windows.

Implementing [3] and [4] together should avoid the need for commit f0eca962 (Cardbus IOMEM in shared video RAM space) since the Cardbus IOMEM would be in a 'high' gap (as it would be with Windows).

Dropping commit f0eca962 would solve [2] since the GFX could allocate 256MB on the 256MB boundary at 0xC0000000 in the largest gap.

There might be an issue if a system has an undeclared shared video memory region and another PCI device that needs a large allocation, but if there was surely Windows would be affected by it too.

Also, Linus' mention of maintaining an unused gap between top-of-RAM and bottom-of-PCI-IOMEM needs to be considered. Would implementation of [2] and [3] negate the need for it? Windows doesn't maintain a similar gap - is there a reason that Linux should?

Proposal for Rewriting Linux PCI allocation

I'm exploring what is required to replace the existing allocation algorithms with something more sophisticated and inline with the chipset recommendations and practice dictated by Windows. I've moved the documentation for that proposal to PCI Dynamic Resource Allocation Management

Code References

These are references to source code that I am reviewing in dealing with this bug.

arch/x86/kernel/e820_64.c::e820_setup_gap()

Previously known as:

arch/x86/kernel/e820.c
arch/i386/kernel/e820.c

Inherited from:

arch/x86/kernel/setup.c

Code-path during resource allocation where failure is reported:

arch/x86/pci/i386.c::pcibios_resource_survey()
 arch/x86/pci/i386.c::pcibios_allocate_bus_resources()
  drivers/pci/pci.c::pci_find_parent_resource()
  kernel/resource.c::request_resource()
  // failure reports:

  [    0.346211] PCI: Cannot allocate resource region 9 of bridge 0000:00:01.0

  arch/x86/pci/i386.c::pcibios_allocate_bus_resources() // re-entrant


 arch/x86/pci/i386.c::pcibios_allocate_resources(0)
 arch/x86/pci/i386.c::pcibios_allocate_resources(1)
  include/linux/pci.h::pci_read_config_word()
  drivers/pci/pci.c::pci_find_parent_resource()
  kernel/resource.c::request_resource()
  // failure reports:

  [    0.350302] PCI: Cannot allocate resource region 1 of device 0000:01:00.0

The code-path taken during resource allocation to devices (not the earlier path that fails to allocate to the 64-bit BAR):

drivers/pci/setup-bus.c::pbus_assign_resources_sorted()
 drivers/pci/setup-res.c::pdev_sort_resources()
 drivers/pci/setup-res.c::pci_assign_resource()
  drivers/pci/pci_bus_alloc_resource()
   kernel/resource.c::allocate_resource()
    kernel/resource.c::find_resource()
    kernel/resource.c::__request_resource()

Attachments