wiki:Linux/Kernel/PCIDynamicResourceAllocationManagement

PCI Dynamic Resource Allocation Management

A proposal for a new PCI resource allocation strategy supported by modular and extensible code that at least puts Linux on a par with current versions of Windows, and if possible provides for non-disruptive extension in the future. It should integrate support for PCI Hot Plug and NUMA functionality. A key part of the proposal is a new Dynamic Resource Allocation Management (PCI-DRAM) that allows the system to reassign resources of bridges and devices - essential for support of PCI Express Hot Plug (using SHPC).

This proposal grew out of the discovery of a bug in the current Linux PCI resource allocation scheme that caused a PCI device to be mapped above the 4GB boundary when the host chipset only supports a 32-bit external address space. See PCI Memory-Mapped Allocation Algorithm Bug.

As well as PCI, PHP and NUMA, this proposal could have implications for ACPI if it supports newer services such as _DSM (Device Specific Method) for (re)configuration after boot-time.


Status

The code implementing PCI-DRAM was essentially completed and tested in May 2008. The kernel header and CONFIG files are attached to this article. The implementation code is currently being tightened up and will be published as soon as that is done. It may be submitted to the linux-pci kernel mailing list with a view to inclusion in the main-line kernel.

Introduction

This is work-in-progress so I can identify the requirements and break them down into discrete packages of changes that can be introduced into the kernel without disruption.

Proposal

  1. Introduce list of gaps in PCI address space.
    1. Linked-list that tracks gap elements.
    2. Add experimental kernel config and boot option to enable using all gaps, rather than just the largest.
    3. Each gap should keep track of how much unallocated space it has, and the largest unallocated space.
    4. pci_mem_start will need replacing - what implications (especially to other architectures besides x86).
    5. Gap detection should start from 0xFEBFFFFF (border of I/O APIC region) and proceed downwards towards lower addresses.
  1. Introduce algorithms for allocating top-down or bottom-up within a gap.
    1. Add experimental kernel config and boot option to choose algorithm.
  1. Introduce algorithms for choosing first gap with unallocated space equal-to or larger-than requested (best-fit).
    1. Add experimental kernel config and boot option to choose gap on best-fit basis.
  1. Investigate whether special attention needs to be given to sub-architecture and chipset capabilities.
    1. Identify 32-bit northbridge in 64-bit CPU systems.
    2. Identify 64-bit BAR capability support in chipset and devices.
    3. Ensure 32-bit sub-architectures (or 64-bit sub-architectures with 32-bit northbridge) don't use 64-bit BAR capability.
    4. Consider what impact this change will have on non-x86 architectures.
  1. Provide functionality for Dynamic Resource Adjustment (PCI-DRA)
    1. Allow a bridge to expand/reduce its resources based on devices behind it.
    2. Allow Linux to change the boot-time configuration established by firmware.
    3. Do drivers currently support the notion of 'pausing' to allow device reconfiguration? Could suspend/resume semantics be used?
  1. Investigate what impact this change will have on PHP.
    1. If providing PCI-DRA what linkage is required between PHP and DRA (notification callbacks, ops structures, etc. ) ?

Memory Address Space Organisation

Intel 945PM PCI Memory Address Range
 Mobile Intel 945 Express Chipset Family Datasheet, section 9.2 Main Memory Address Range. figure 13 PCI Memory Address Range

Specifics

Prototyping

I am now actively programming the new system. I've started out with the easy bit; defining the configuration handling, flags, and helper functions for discovering the available regions, sorting, and otherwise processing the region list. It takes into account the BIOS 0xe820, EFI GetMemoryMap(), and ACPI views of memory. I'm providing the facility for other parts of the kernel to register inspectors that are called once the region list is populated and can report conflicts and request changes be made to the list (e.g. disable a region, resize it, split it, delete it). This provides an easy way to deal with quirks by having quirk-aware drivers spot and report issues before they can affect the system. All regions provide the option to be disabled and the number of regions used for allocation can be set. Each region keeps track of how much unused space remains.

Everything is configurable both at compile-time and on the kernel command line - enabling easy experimentation. A policy can be chosen (Legacy, New Linux, Win2003, Win2008) or individual flags can be set that give fine-grained control of the IOMEM allocation system. The code is architecture agnostic so there doesn't need to be a split between 32- and 64-bit architectures.

It is preferable to build the list of available regions as late as possible in the start-up process. As the IOMEM regions are specific to PCI it would make sense that it is done during PCI initialisation rather than very early in the initialisation process as is the case now (when the e820 processing is done). Right now it can be plugged in anywhere and that very much describes my approach. I'm making it pluggable so it is easy to enable/disable without invasive patching.

My aims are:

  1. Successfully detect and build the IOMEM regions list and report it during boot.
  2. Choose the largest region for allocations and assign its start to pci_mem_start aka PCIBIOS_MIN_MEM so the current allocation routines are still used.
  3. Introduce the top-down allocation system within the single region.
  4. Introduce the multi-region system with configurable allocation algorithms.
  5. Demonstrate successful operation and ability to mimic the behaviour and allocations of Windows.
  6. Introduce the DRA code and experimental drivers to test it.

Current Code

See Linux Kernel Boot Process for a diagram of the call-path that affects the discovery of available regions.

This is a copy of an email submitted to the pci-linux mailing list to try to clarify differences between the 32- and 64-bit implementations. In summary, the answer is that it doesn't really matter.

Why such a big difference in init-time PCI resource call-paths (x86 vs x86_64)

In preparation for writing a Windows-style PCI resource allocation strategy

  • use all e820 gaps for IOMEM resources; top-down allocation -

and thus giving devices with large IOMEM requirements more chance of allocation in the 32-bit address space below 4GB (see bugzilla #10461), I'm mapping out the current implementation to ensure I understand the implications and guard against breaking any quirks or undocumented features.

There are significant differences between the x86 and x86_64 implementations and I've not been able to find any explanation or deduce a reason for it.

The first difference is mainly cosmetic. With one minor difference the source code of:

arch/x86/kernel/e820_32.c::e820_register_memory()
arch/x86/kernel/e820_64.c::e820_setup_gap()

is identical. The purpose is to find the single largest gap in the e820 map and point pci_mem_start to the start of the region (with a rounding-up adjustment).

Why do these functions have such differing names? Which name is preferable?

The second difference appears more profound. In the x86_64 call path, the code that reserves resources and requests the ioport resource is called in:

arch/x86/kernel/setup_64.c::setup_arch()

immediately *before* the call to e820_setup_gap():

        e820_reserve_resources();
        e820_mark_nosave_regions();

        /* request I/O space for devices used on all i[345]86 PCs */
        for (i = 0; i < ARRAY_SIZE(standard_io_resources); i++)
                request_resource(&ioport_resource, &standard_io_resources[i]);

        e820_setup_gap();

On x86 however:

        e820_register_memory();
        e820_mark_nosave_regions();

Although e820_register_memory() and e820_mark_nosave_regions() are called at basically the same point in:

arch/x86/kernel/setup_32.c::setup_arch()

as the 64-bit setup_arch() call to e820_setup_gap(), the equivalent of the 64-bit e820_reserve_resources():

arch/x86/kernel/e820_32.c::init_iomem_resources()

and the loop calling request_resource() is in:

arch/x86/kernel/setup_32.c::request_standard_resources()

which is declared as:

subsys_initcall(request_standard_resources);

Now, unless my call-path tracing has gone way wrong, this initcall won't happen until *much* later after a *lot* of other 'stuff' has been done. If my understanding of the initcall mechanism is correct, it finally gets called in:

init/main.c::do_initcalls()

as part of the generic multi-level device init call-by-address functionality where each function is in the .initcallX.init section and they have names of the form initcall_functionX (where X is the level id).

So, why the big difference in implementations? What are the implications of each? Is one preferable to the other?

Gap Management

Identifying PCI Express MMCFG

The gap detection logic must step over the PCI Express configuration space. Sometimes at 0xE0000000-0xEFFFFFFF but should be declared by the ACPI MCFG table in \_SB\{node}._HID(EISA("PNP0C02")) and may be defined by BIOS-e820 (optional - not required if ACPI MCFG is present).

Standards

 PCI Local Bus Specification version 3.0
 PCI Express® Base 2.0 Specification
 PCI Firmware 3.0 Specification
 PCI Hot-Plug 1.1
 Desktop Management Interface (DMI) standards

Terms

PCI
Peripheral Component Interconnect
PCIe
PCI Express
SHPC
Standard Hot Plug Controller
PHP
PCI Hot Plug
NUMA
Non Uniform Memory Access
BAR
Base Address Register
DRA
Dynamic Resource Allocation
MCFG
Memory-Mapped PCI Configuration Space table
ACPI
Advanced Configuration and Power Interface

Windows Techniques

These dictate much of the firmware and device functionality Linux has to deal with. Working with the flow seems sensible.

  • Windows Vista supports 64-bit memory base address registers (BARs) on PCI devices and configures them in the address space above the 4‑GB boundary.
  • Windows Vista also supports multilevel rebalance, which allows PCI bridge windows to be dynamically sized based on the resource requirements of the devices behind them.
  • On IA-PC systems, this is conveyed to the operating system through the INT15H E820H Query System Address Map interface, whereas EFI-enabled systems use the EFI GetMemoryMap boot services function.

System Address Space and Resource Arbitration in Windows

A 32‑bit processor that supports Intel physical address extensions (PAE) can provide between 36 and 40 bits of physical addressability. PAE is an Intel-provided memory address extension that enables support of greater than 4 GB of physical memory for most 32-bit (IA-32) Intel Pentium Pro and later processors. By default, 32-bit versions of the Windows operating systems can address only up to 4 GB of address space. With PAE enabled on capable versions of Windows, the operating system can address up to 37 bits, or up to 128 GB. Therefore, a firmware developer must lay out the system address space while remembering the different versions of the Windows operating systems that can be installed on a system.

Constraints on System Address Space

A reasonable division of address space for all resources below the 4GB line is also required to support 32-bit and 64-bit architectures of Windows. Therefore, it is not possible to have all PCI device resources in the address space below 4GB.

PCI Resource Arbitration in Windows

Windows operating systems configure all PCI bridges and devices in a single pass. This means that the operating system enumerates, configures, and starts a PCI bridge before scanning the secondary side of the bus for PCI devices behind the bridge. All PCI devices on the bridge are arbitrated with resources that fall inside the bridge resource windows.

Bridge Window Configuration in Windows XP and Windows Server™ 2003

Microsoft Windows XP and Microsoft Windows Server™ 2003 do not reconfigure the bridge windows based on the requirements of a device behind the bridge. This leads to a classic problem where a PCI device cannot be started due to lack of resources on the bridge, even though enough device resources are available to the system. For reasons such as this, a platform configuration that configures PCI devices at boot time works best for Windows XP and Windows Server 2003.

Multilevel Resource Rebalance in Windows Vista

Windows Vista implements a feature called multilevel rebalance. This resource assignment technique allows the operating system to dynamically reconfigure resource assignments across multiple hierarchical levels in a device tree. In Windows Vista, if a PCI device’s resource requirement cannot be arbitrated inside the current bridge resource window, the operating system reconfigures the PCI bridge with a new set of resources to accommodate the PCI device requirements. Because Windows Vista with multilevel rebalance is better at arbitrating PCI resources, a platform configuration that avoids boot configuration of PCI devices works best for Windows Vista.

  • Windows Driver Model (WDM) uses stop semantics. The Stop request is an operational pause where the driver takes necessary steps to suspend an operation until the assignment of new resources is completed.

Boot Configuration of PCI Devices

All recent versions of Windows operating systems, including Windows XP, Windows Server 2003, and Windows Vista, respect and attempt to preserve the boot configuration of PCI devices. If it is impossible to do so, the operating system chooses an acceptable location for the device.

Devices above 4 GB in Windows XP and Windows Server 2003

Devices that are configured to boot with a resource above the 4‑GB boundary are handled differently on different versions of the Windows operating system. On Windows XP and Windows Server 2003, the device is assigned a resource from a region below the 4‑GB boundary, effectively ignoring the boot configuration. If device resource cannot be allocated below the 4‑GB marker, then the device is assigned a range above 4 GB, irrespective of the processor’s addressing capability or the version of Windows that is running. This configuration leaves the device in an inoperable state on 32-bit versions of Windows XP and Windows Server 2003.

Devices above 4 GB in Windows Vista

Windows Vista always respects the boot configuration of devices above 4 GB, considering the processor’s addressing capability and the version of the Windows operating system that is running.

Microsoft recommends the following firmware implementation guidelines

  • Reserve nonconflicting resources above 4 GB in the _CRS method of a PCI root bus.
  • Use a QWORD memory descriptor in the _CRS method of a PCI root bus to define a memory range. This range is then available as a PCI device memory resource to the entire hierarchy that emanates from the root bus.

Windows XP and Windows Server 2003 effectively ignore this range, whereas Windows Vista uses this range if the processor and operating system version allow it.

  • Assign boot configurations for PCI devices below 4 GB to provide compatibility with Windows XP and Windows Server 2003.
  • Implement the _DSM method to allow Windows Vista to ignore PCI device boot configurations, as described later in this paper. This ensures the most flexible resource allocation on Windows Vista.

Windows Vista _DSM Implementation Details

If the device supports 64-bit prefetchable memory BARs, Windows Vista attempts to assign a region above 4 GB. In a PCI bridge, Windows Vista ignores boot configuration for an entire device path emanating from the bridge in whose scope this method is defined. For the bridge and devices below it to be assigned a region above 4 GB, all devices in the path must support 64-bit prefetchable BARs. If this is not true, the rebalance code runs and moves all resource assignments below 4 GB, because the goal is to start as many devices as possible.


References

ACPI Specification v3.0a, 15.3.2 BIOS Initialisation of Memory
 Intel® Developer Network for PCI Express* Architecture
 Microsoft, PCI Express Technologies
 Firmware Allocation of PCI Device Resources in Windows
 PCI Multi-Level Rebalance in Windows Vista
 Microsoft Windows, PCI Bus, device driver kit
 Microsoft Windows, PCI/PCI Express Compliance Test (PCITEST2_1)
 Where can I find programming info on PCI?
 Linux Support for NUMA Hardware
 Linux Hot Plugging

Resources

Local repository of Industry Standard Specifications

Attachments

  • pci-dram.h Download (11.6 KB) - added by tj 9 years ago. Kernel header for PCI-Dynamic Resource Allocation Management
  • Kconfig.pci-dram Download (8.3 KB) - added by tj 9 years ago. Kernel CONFIG file for PCI-Dynamic Resource Allocation Management