The Curse of ASPM

You either get to C8 or you die trying.

z8

2023-11-02

For the past couple weeks I’ve been trying to get to the bottom of a problem that’s very near and dear to my heart: my home server’s power consumption.

This isn’t going to be some sort of tutorial for fixing a server, but maybe it’ll give you ideas as for what to look at when you’re trying to get to the bottom of an issue like this. I’m not necessarily new to Linux, but have never really messed with PCIe or UEFI like this before. You’ve been warned. Before you send me angry emails going “but everyone already knows all of this, and those who don’t just need to do some googling”, well while that might be true, I would like to preserve the sanity of whoever finds themselves in a situation like mine, because I certainly wanted to stab myself multiple times while writing this post.

TL;DR: It took a new HBA, new 10G card, a UEFI tool, a script, powertop and loads of mental duct tape to finally resolve all of this. Well, I’m not entirely sure if “resolved” is the right word to describe it.

The problem at hand

When I set out to move away from unRAID to TrueNAS Scale and ZFS, I intentionally picked very low power components in an effort to try and save power. Here’s a parts list:

Aside from having six HDDs, nothing immediately stands out to me as being very power hungry. But still, this machine sits in my closet consuming about 90W at idle. Turning on HDD spin down brings that number down to 58W. Still nowhere close to what you’d expect from a system with a tiny i3 and no dedicated GPU. Time to investigate.

Establishing a baseline

I didn’t do my testing on TrueNAS Scale, but instead used a new Ubuntu install on an MX500 250GB SSD. It’s much easier to just reinstall that in case I completely break something. The kind of thing you’ve got to account for when you’re just throwing stuff at the wall and seeing what sticks. Also Ubuntu was the OS that the folks over on the unRAID forums suggested for troubleshooting, because apparently “stuff just works there.”

Question #1: When I take this exact same system, don’t unplug anything, and just swap TrueNAS Scale for Ubuntu, do I get the same numbers?

Answer: Not quite. The power meter says 85W at idle. However I think we can all agree that this is far too much for this tiny system.

What does powertop say?

The attentive reader might notice that the CPU package does not enter C2 or any of the states below it. This is really bad. Meanwhile all the cores are chilling at C7.

Just for fun I ran sudo powertop --auto-tune. It lowered the power consumption by maybe 1W at most but also enabled auto-suspend for all of my USB devices so now my mouse and keyboard don’t work quite like they should. What a terrible idea lmao

Now I finally get to unplug stuff.

Out go the SAS Expander, 10G card and the HBA.

With the BIOS reset to defaults and nothing plugged into the computer it pulls 28W from wall.

What does powertop say now?

Still bad. Time to change some BIOS settings.

Settings → Platform Power:
    Platform Power Management: Enabled
    PEG ASPM:                  Enabled
    PCH ASPM:                  Enabled
    DMI ASPM:                  Enabled

Result: 27W. Ouch. Okay, not all hope is lost. There are still more settings I can tweak.

Tweaker → Advanced CPU Settings:
    CPU EIST Function:               Enabled
    Intel(R) Turbo Boost Technology: Disabled
    C-States Control:                Enabled
    CPU Enhanced Halt (C1E):         Enabled
    C3 State Support:                Enabled
    C6/7 State Support:              Enabled
    C8 State Support:                Enabled
    C10 State Support:               Enabled
    Package C State Limit:           C10

Settings → IO Ports:
    Audio Controller:       Disabled
    OnBoard LAN Controller: Disabled

18 Watts. As you can see I disabled the on-board audio and network controllers. This board has 2.5Gb Ethernet, but I have a 10Gb PCIe card and switch so it is of no use for me. And it’s a server, so why bother with audio?

powertop reports the following:

We’re in C2, that’s progress. Come on, powertop --auto-tune, I know you can do it…

And we’re in C8! Yes!

The power meter reads 16W. I think this may just be the absolute minimum as far as power consumption goes.

A look at lspci reveals that every PCI device currently running has ASPM enabled.

user@user-Z590-D:~$ sudo lspci -vvv | grep "ASPM .*abled"
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk-
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk-
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk-
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled+ CommClk+
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk-
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk-
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk-

user@user-Z590-D:~$ sudo lspci -vvv | grep "ASPM"
LnkCap: Port #17, Speed 8GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
    ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk-
LnkCap: Port #21, Speed 8GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
    ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk-
LnkCap: Port #1, Speed 8GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
    ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk-
LnkCap: Port #4, Speed 8GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <1us, L1 <16us
    ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled+ CommClk+
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
LnkCap: Port #5, Speed 8GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
    ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk-
LnkCap: Port #9, Speed 8GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
    ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk-
LnkCap: Port #13, Speed 8GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
    ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk-

Amazing! So now we know that the system is indeed capable of running without drawing too much power on its own. Certainly a step in the right direction.

Finding the culprit(s)

Mellanox ConnectX-3

Time to plug stuff back in. With the Mellanox ConnectX-3 back in my system, power usage shoots back up to 25W. powertop shows that the system is stuck at C2. Running powertop --auto-tune gets the system down to C3 and 23-24W. Something’s definitely not right.

What does lspci say this time?

user@user-Z590-D:~$ sudo lspci -vvv | grep "ASPM .*abled"
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk-
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk-
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk-
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled+ CommClk+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk-
LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk-
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+

Well well well… What device could that possibly be?

user@user-Z590-D:~$ sudo lspci -vvv -s 05:00.00
05:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]
    Subsystem: Mellanox Technologies MT27500 Family [ConnectX-3]
    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Interrupt: pin A routed to IRQ 16
    Region 0: Memory at 52100000 (64-bit, non-prefetchable) [size=1M]
    Region 2: Memory at 50000000 (64-bit, prefetchable) [size=8M]
    Expansion ROM at 52000000 [disabled] [size=1M]
    Capabilities: [40] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [48] Vital Product Data
        Product Name: CX311A - ConnectX-3 SFP+
        Read-only fields:
            [PN] Part number: MCX311A-XCAT_A       
            [EC] Engineering changes: A7
            [SN] Serial number: MT1621X14309            
            [V0] Vendor specific: PCIe Gen3 x4
            [RV] Reserved: checksum good, 0 byte(s) reserved
        Read/write fields:
            [V1] Vendor specific: N/A   
            [YA] Asset tag: N/A                     
            [RW] Read-write area: 109 byte(s) free
            [RW] Read-write area: 253 byte(s) free
            [RW] Read-write area: 253 byte(s) free
            [RW] Read-write area: 253 byte(s) free
            [RW] Read-write area: 253 byte(s) free
            [RW] Read-write area: 253 byte(s) free
            [RW] Read-write area: 253 byte(s) free
            [RW] Read-write area: 253 byte(s) free
            [RW] Read-write area: 253 byte(s) free
            [RW] Read-write area: 253 byte(s) free
            [RW] Read-write area: 253 byte(s) free
            [RW] Read-write area: 253 byte(s) free
            [RW] Read-write area: 253 byte(s) free
            [RW] Read-write area: 253 byte(s) free
            [RW] Read-write area: 253 byte(s) free
            [RW] Read-write area: 252 byte(s) free
        End
    Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
        Vector table: BAR=0 offset=0007c000
        PBA: BAR=0 offset=0007d000
    Capabilities: [60] Express (v2) Endpoint, MSI 00
        DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
            ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 116.000W
        DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
            RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
            MaxPayload 256 bytes, MaxReadReq 512 bytes
        DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
        LnkCap: Port #8, Speed 8GT/s, Width x4, ASPM L0s, Exit Latency L0s unlimited
            ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 8GT/s (ok), Width x4 (ok)
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
             10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS- TPHComp- ExtTPHComp-
             AtomicOpsCap: 32bit- 64bit- 128bitCAS-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
             AtomicOpsCtl: ReqEn-
        LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
        LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
             EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [c0] Vendor Specific Information: Len=18 <?>
    Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
        ARICap: MFVC- ACS-, Next Function: 0
        ARICtl: MFVC- ACS-, Function Group: 0
    Capabilities: [148 v1] Device Serial Number 24-8a-07-03-00-5e-31-70
    Capabilities: [154 v2] Advanced Error Reporting
        UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
        CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
            MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
        HeaderLog: 00000000 00000000 00000000 00000000
    Capabilities: [18c v1] Secondary PCI Express
        LnkCtl3: LnkEquIntrruptEn- PerformEqu-
        LaneErrStat: 0
    Kernel driver in use: mlx4_core
    Kernel modules: mlx4_core

And what’s the second device in that list? Let’s have a look and see what the Mellanox card is plugged into:

user@user-Z590-D:~$ sudo lspci -t
-[0000:00]-+-00.0
           +-02.0
           +-14.0
           +-14.2
           +-16.0
           +-17.0
           +-1b.0-[01]--
           +-1b.4-[02]--
           +-1c.0-[03]--
           +-1c.3-[04]--
           +-1c.4-[05]----00.0
           +-1d.0-[06]--
           +-1d.4-[07]--
           +-1f.0
           +-1f.4
           \-1f.5

And looking at 1c.4 gets us:

user@user-Z590-D:~$ sudo lspci -vvv -s 00:1c.4
00:1c.4 PCI bridge: Intel Corporation Tiger Lake-H PCI Express Root Port #5 (rev 11) (prog-if 00 [Normal decode])
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Interrupt: pin A routed to IRQ 124
    Bus: primary=00, secondary=05, subordinate=05, sec-latency=0
    I/O behind bridge: 0000f000-00000fff [disabled]
    Memory behind bridge: 52000000-521fffff [size=2M]
    Prefetchable memory behind bridge: 0000000050000000-00000000507fffff [size=8M]
    Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
    BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
        PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
    Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
        DevCap: MaxPayload 256 bytes, PhantFunc 0
            ExtTag- RBE+
        DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
            RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
            MaxPayload 256 bytes, MaxReadReq 128 bytes
        DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
        LnkCap: Port #5, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <16us
            ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
        LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 8GT/s (ok), Width x4 (ok)
            TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
        SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
            Slot #8, PowerLimit 25.000W; Interlock- NoCompl+
        SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
            Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
        SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
            Changed: MRL- PresDet- LinkState-
        RootCap: CRSVisible-
        RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
        RootSta: PME ReqID 0000, PMEStatus- PMEPending-
        DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR+
             10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS- LN System CLS Not Supported, TPHComp- ExtTPHComp- ARIFwd+
             AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled, ARIFwd+
             AtomicOpsCtl: ReqEn- EgressBlck-
        LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
        LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
             EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
        Address: fee20004  Data: 0021
    Capabilities: [90] Subsystem: Gigabyte Technology Co., Ltd Tiger Lake-H PCI Express Root Port
    Capabilities: [a0] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [100 v1] Advanced Error Reporting
        UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt: DLP+ SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
        CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
            MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
        HeaderLog: 00000000 00000000 00000000 00000000
        RootCmd: CERptEn+ NFERptEn+ FERptEn+
        RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
             FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
        ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
    Capabilities: [220 v1] Access Control Services
        ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
    Capabilities: [150 v1] Precision Time Measurement
        PTMCap: Requester:- Responder:+ Root:+
        PTMClockGranularity: 4ns
        PTMControl: Enabled:+ RootSelected:+
        PTMEffectiveGranularity: Unknown
    Capabilities: [200 v1] L1 PM Substates
        L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
              PortCommonModeRestoreTime=40us PortTPowerOnTime=44us
        L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
               T_CommonMode=40us LTR1.2_Threshold=81920ns
        L1SubCtl2: T_PwrOn=44us
    Capabilities: [a30 v1] Secondary PCI Express
        LnkCtl3: LnkEquIntrruptEn- PerformEqu-
        LaneErrStat: 0
    Capabilities: [a00 v1] Downstream Port Containment
        DpcCap: INT Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 4, DL_ActiveErr+
        DpcCtl: Trigger:1 Cmpl- INT+ ErrCor- PoisonedTLP- SwTrigger- DL_ActiveErr-
        DpcSta: Trigger- Reason:00 INT- RPBusy- TriggerExt:00 RP PIO ErrPtr:1f
        Source: 0000
    Kernel driver in use: pcieport

So the Mellanox ConnectX-3 has ASPM disabled and as a result of that the PCIe slot it is plugged into also reports that ASPM is disabled.

Before I go off on a big tangent here, let’s just keep going and try different components.

HP SAS Expander

The SAS Expander is not actually a PCIe device. It merely uses the PCIe slot for power. Just for the sake of being thorough, it draws 9 watts when plugged in.

LSI 9211-8i

With the HBA back in the system, I’m seeing the exact same problem. Power consumption goes up more than it should, the system again can’t seem to go below C3 and lspci this time presents me with the following information:

26W total system power draw without the powertop thingy, 25W with it. We are once again stuck in C3 with nowhere to go. For the record, this card reportedly only uses about 6W.

user@user-Z590-D:~$ sudo lspci -vvv -s 02:00.0
02:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
    Subsystem: Broadcom / LSI 9210-8i
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Interrupt: pin A routed to IRQ 16
    Region 0: I/O ports at 4000 [size=256]
    Region 1: Memory at 517c0000 (64-bit, non-prefetchable) [size=16K]
    Region 3: Memory at 51380000 (64-bit, non-prefetchable) [size=256K]
    Expansion ROM at 51300000 [disabled] [size=512K]
    Capabilities: [50] Power Management version 3
        Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [68] Express (v2) Endpoint, MSI 00
        DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
        DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
            MaxPayload 256 bytes, MaxReadReq 512 bytes
        DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
        LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns
            ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
        LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 5GT/s (ok), Width x4 (downgraded)
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range BC, TimeoutDis+ NROPrPrP- LTR-
             10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS- TPHComp- ExtTPHComp-
             AtomicOpsCap: 32bit- 64bit- 128bitCAS-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
             AtomicOpsCtl: ReqEn-
        LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
             EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
        Address: 0000000000000000  Data: 0000
    Capabilities: [c0] MSI-X: Enable+ Count=15 Masked-
        Vector table: BAR=1 offset=00002000
        PBA: BAR=1 offset=00003800
    Capabilities: [100 v1] Advanced Error Reporting
        UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
        CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
            MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
        HeaderLog: 00000000 00000000 00000000 00000000
    Capabilities: [138 v1] Power Budgeting <?>
    Capabilities: [150 v1] Single Root I/O Virtualization (SR-IOV)
        IOVCap: Migration-, Interrupt Message Number: 000
        IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
        IOVSta: Migration-
        Initial VFs: 16, Total VFs: 16, Number of VFs: 0, Function Dependency Link: 00
        VF offset: 1, stride: 1, Device ID: 0072
        Supported Page Size: 00000553, System Page Size: 00000001
        Region 0: Memory at 00000000517c4000 (64-bit, non-prefetchable)
        Region 2: Memory at 00000000513c0000 (64-bit, non-prefetchable)
        VF Migration: offset: 00000000, BIR: 0
    Capabilities: [190 v1] Alternative Routing-ID Interpretation (ARI)
        ARICap: MFVC- ACS-, Next Function: 0
        ARICtl: MFVC- ACS-, Function Group: 0
    Kernel driver in use: mpt3sas
    Kernel modules: mpt3sas

Asking the internet for advice

There are plenty of smart people on the internet, perhaps someone else figured out something I can’t.

Mellanox ConnectX-3

Seems like this card doesn’t support ASPM at all. Well that was a waste of my time.

LSI 9211-8i

Here’s where things get really interesting. The official LSI users guide for the 92xx series HBAs explicitly lists ASPM as a feature, but ASPM is set to disabled. Now why would that be? Turns out that back in 2013 someone filed a bug report stating that their system would constantly lock up during high read workloads. The solution for this turned out to be to simply disable ASPM. For any of the MPI v2.0 chipsets listed in mpi2_cnfg.h ASPM will explicitly be disabled.

So unless I am willing to take my chances and live with the possibility of my storage array doing god knows what during rebuilds I don’t think this is worth pursuing. However, this does not apply to MPI v2.5 and MPI v2.6 devices. While there is also a patch for newer devices running that same mpt2sas/mpt3sas driver, it appears to not have gone anywhere. A shimmer of hope.

ASPM works, except when it doesn’t

Some investigative work appears to be in order. The devices say that they are capable of ASPM at least… First, let’s see if my system even correctly supports/implements ASPM:

root@user-Z590-D:/home/user# dmesg | grep ASPM
[    0.469814] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
[   16.221745] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI EDR HPX-Type3]
[   16.253813] acpi PNP0A08:00: FADT indicates ASPM is unsupported, using BIOS configuration

Now this is bizarre. One would assume that if they enable ASPM within the BIOS, the operating system would also be informed of that. My first thought after reading that was: “Can ASPM somehow forcibly be enabled?”

Apparently the answer to that questions is “yes, sort of, but your mileage may vary”.

The kernel

There is a kernel parameter for it, see here.

With a quick edit to /boot/grub/grub.cfg and the pcie_aspm=force parameter now set, we reboot the system and are surprised to learn that… nothing has changed.

root@user-Z590-D:/home/user# dmesg | grep ASPM
[    0.118848] PCIe ASPM is forcibly enabled
[    0.421773] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
[   11.801846] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI EDR HPX-Type3]
[   11.825805] acpi PNP0A08:00: FADT indicates ASPM is unsupported, using BIOS configuration

Might as well reverse it again.

setpci and enable_aspm

In the process of figuring out what the hell was happening here I did also stumble across this wiki article and these two sections: “Enabling ASPM with enable_aspm” and “Enabling ASPM with setpci”. To test these methods I used the Mellanox card.

Interestingly enough the script didn’t work properly when I tried it. I don’t even quite get why.

root@user-Z590-D:/home/user/Downloads# ./aspm.sh 
Root complex:
00:1c.4 PCI bridge: Intel Corporation Tiger Lake-H PCI Express Root Port #5 (rev 11)
    0x50 : 0x43 --> 0x41 ...    [SUCCESS]

Endpoint:
(standard_in) 1: syntax error
setpci: Unknown register "".
Try `setpci --help' for more information.
setpci: Unknown register "".
Try `setpci --help' for more information.
./aspm.sh: line 174: printf: 0x: invalid hex number

[... this repeats a dozen times ...]

Long loop while looking for ASPM word for 05:00.0

Instead of trying to debug some ancient shell script I might as well just re-write the entire thing. Have a look: GitHub - 0x666690/ASPM/aspm.py

root@user-Z590-D:/home/user/Documents/GitHub/ASPM# python3 aspm.py 
00:1c.4 PCI bridge: Intel Corporation Tiger Lake-H PCI Express Root Port #5 (rev 11)
0x34 points to 0x40
Value at 0x40 is 0x10
Found the byte at: 0x40
Adding 0x10 to the register...
Final register reads: 0x40
Byte to patch: 0x50
Byte is set to 0x40
-> ASPM_DISABLED
Value doesn't match the one we want, setting it!
Byte is set to 0x43
-> ASPM_L1_AND_L0s
05:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]
0x34 points to 0x40
Value at 0x40 is 0x1
Value is not 0x10!
Reading the next byte...
0x41 points to 0x48
Value at 0x48 is 0x3
Value is not 0x10!
Reading the next byte...
0x49 points to 0x9c
Value at 0x9c is 0x11
Value is not 0x10!
Reading the next byte...
0x9d points to 0x60
Value at 0x60 is 0x10
Found the byte at: 0x60
Adding 0x10 to the register...
Final register reads: 0x40
Byte to patch: 0x70
Byte is set to 0x40
-> ASPM_DISABLED
Value doesn't match the one we want, setting it!
Byte is set to 0x43
-> ASPM_L1_AND_L0s

If everything works as expected, two things should happen:

Unfortunately only the former is the case.

# Left out all the unimportant bits...
root@user-Z590-D:/home/user# lspci -vvv -s 05:00.0
05:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]
    Capabilities: [60] Express (v2) Endpoint, MSI 00
        DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
            ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 116.000W
        LnkCap: Port #8, Speed 8GT/s, Width x4, ASPM L0s, Exit Latency L0s unlimited
            ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

We will revisit this later.

ACPI tables and UEFI

You know what, let’s go back to the beginning. The error message said something about FADT, but what does that even mean? As with most things in life, the internet has the answer.

This is where I have to openly admit that I don’t know a damn thing about ACPI tables. The only ones I am somewhat familiar with are the DSDT and the SSDT, but only because those were the ones giving 15-year-old me trouble getting my Hackintosh to go to sleep. Good times.

The OSDev Wiki doesn’t tell us where exactly the table reveals information about ASPM, so it’s time to read some documentation.

It appears that this is merely done through a set bit in IAPC_BOOT_ARCH. If the bit is set, the system signals to the OS that it does indeed not support proper power management. Now I wonder what that particular bit is set to in my OS…

Time to dump that table:

user@user-Z590-D:~/Desktop$ sudo apt-get install acpica-tools
user@user-Z590-D:~/Desktop$ sudo acpidump -b -n FACP
user@user-Z590-D:~/Desktop$ iasl -d facp.dat 

Intel ACPI Component Architecture
ASL+ Optimizing Compiler/Disassembler version 20200925
Copyright (c) 2000 - 2020 Intel Corporation

File appears to be binary: found 237 non-ASCII characters, disassembling
Binary file appears to be a valid ACPI table, disassembling
Input file facp.dat, Length 0x114 (276) bytes
ACPI: FACP 0x0000000000000000 000114 (v06 ALASKA A M I    01072009 AMI  01000013)
Acpi Data Table [FACP] decoded
Formatted output:  facp.dsl - 10157 bytes

Let’s have a look:

               Legacy Devices Supported (V2) : 1
            8042 Present on ports 60/64 (V2) : 0
                        VGA Not Present (V4) : 0
                      MSI Not Supported (V4) : 0
                PCIe ASPM Not Supported (V4) : 1
                   CMOS RTC Not Present (V5) : 0
[06Fh 0111   1]                     Reserved : 00

Sigh… there we have the answer. The thing is though, I’ve seen this board act the right way when nothing was plugged in. Can I just patch this and pretend that everything is okay?

Yes.

Now there are two ways of doing it:

Since I don’t plan on using this Ubuntu install until the end of time and would like to go back to TrueNAS Scale eventually, I think I’m gonna go with the second route. The simplest way of doing things would be to simply have a USB flash drive that is completely separate from the main boot drive for me to get into a UEFI shell, run a program that will patch the table, and then jump to the actual bootloader that will load TrueNAS.

I have little to no experience writing UEFI drivers, but luckily there’s already a tool out there for patching a different part of the ACPI tables for me to build upon.

Big big thank you to James Swineson for his work on the S0ixEnabler.

After compiling a binary and putting it on a USB stick, we can see that this indeed patches things up.

Shell> fs0:\ASPMEnabler.efi

ASPMEnabler
https://github.com/0x666690/ASPM
A modified version of: https://github.com/Jamesits/S0ixEnabler
Firmware American Megatrends Rev 327699

Table #1/14: Not RDSP
Table #2/14: Not RDSP
Table #3/14: Not RDSP
Table #4/14: Not RDSP
Table #5/14: Not RDSP
Table #6/14: Not RDSP
Table #7/14: RDSP Rev 0 @0x3B513000 | No XSDT
Table #8/14: RDSP Rev 2 @0x3B513014 | XSDT OEM ID: ALASKA Tables: 26
ACPI table #1/26: FACP Rev 6 OEM ID: ALASKA
Checking initial checksum... OK
Patching FADT table...
FADT::IaPcBootArch before: 0x11
FADT::IaPcBootArch after: 0x1
Checksum before: 0x2
Checksum after: 0x3F
Re-check... OK
FADT table patch finished
ASPMEnabler done

Shell> fs1:\EFI\ubuntu\grubx64.efi

No more complaints about the FADT!

root@user-Z590-D:/home/user# dmesg | grep ASPM
[   10.443179] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI EDR HPX-Type3]

Unfortunately this doesn’t magically fix the fact that both of the devices I have are generally incompatible with ASPM. Oh well.

New hardware

You know what, I’m not taking any more chances. I’ll only buy something once at least one other person has vouched for it.

Mellanox ConnectX-4

I shouldn’t have said that.

I’m not even gonna say much here and let the pictures do the talking.

Didn’t even take a proper screenshot. Just took a photo of my screen, put the card back in its box and shipped it back the day I got it.

Intel X710-DA2

A comment by unRAID forum member h0schi in this thread is what made me buy mine.

One thing to note though: these cards are vendor-locked (in theory). In order to use them with any of the SFP+ modules that aren’t whitelisted, you’ll need to patch it with this: GitHub - bibigon812/xl710-unlocker

Now, does this one have proper ASPM support?

Hell yes.

We’re down to about 22W without powertop --auto-tune and 19W with it. Now those are finally some numbers that I’m happy with.

root@user-Z590-D:/home/user# lspci -vvv -s 02:00.0
02:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
    Subsystem: Intel Corporation Ethernet Converged Network Adapter X710-2
    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Interrupt: pin A routed to IRQ 16
    Region 0: Memory at 53800000 (64-bit, prefetchable) [size=8M]
    Region 3: Memory at 54c00000 (64-bit, prefetchable) [size=32K]
    Expansion ROM at 53180000 [disabled] [size=512K]
    Capabilities: [40] Power Management version 3
        Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
    Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Address: 0000000000000000  Data: 0000
        Masking: 00000000  Pending: 00000000
    Capabilities: [70] MSI-X: Enable+ Count=129 Masked-
        Vector table: BAR=3 offset=00000000
        PBA: BAR=3 offset=00001000
    Capabilities: [a0] Express (v2) Endpoint, MSI 00
        DevCap: MaxPayload 2048 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
        DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
            MaxPayload 256 bytes, MaxReadReq 512 bytes
        DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
        LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <16us
            ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 8GT/s (ok), Width x4 (ok)
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
             10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS- TPHComp- ExtTPHComp-
             AtomicOpsCap: 32bit- 64bit- 128bitCAS-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
             AtomicOpsCtl: ReqEn-
        LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
        LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
             EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [e0] Vital Product Data
        Product Name: XL710 40GbE Controller
        Read-only fields:
            [PN] Part number: 
            [EC] Engineering changes: 
            [FG] Unknown: 
            [LC] Unknown: 
            [MN] Manufacture ID: 
            [PG] Unknown: 
            [SN] Serial number: 
            [V0] Vendor specific: 
            [RV] Reserved: checksum good, 0 byte(s) reserved
        Read/write fields:
            [V1] Vendor specific: 
        End
    Capabilities: [100 v2] Advanced Error Reporting
        UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
        CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
            MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
        HeaderLog: 00000000 00000000 00000000 00000000
    Capabilities: [140 v1] Device Serial Number cc-7d-ad-ff-ff-fe-fd-3c
    Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
        ARICap: MFVC- ACS-, Next Function: 1
        ARICtl: MFVC- ACS-, Function Group: 0
    Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
        IOVCap: Migration-, Interrupt Message Number: 000
        IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
        IOVSta: Migration-
        Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 00
        VF offset: 16, stride: 1, Device ID: 154c
        Supported Page Size: 00000553, System Page Size: 00000001
        Region 0: Memory at 0000000053400000 (64-bit, prefetchable)
        Region 3: Memory at 0000000054c10000 (64-bit, prefetchable)
        VF Migration: offset: 00000000, BIR: 0
    Capabilities: [1a0 v1] Transaction Processing Hints
        Device specific mode supported
        No steering table available
    Capabilities: [1b0 v1] Access Control Services
        ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
    Capabilities: [1d0 v1] Secondary PCI Express
        LnkCtl3: LnkEquIntrruptEn- PerformEqu-
        LaneErrStat: 0
    Kernel driver in use: i40e
    Kernel modules: i40e

Broadcom/LSI 9305-24i

With MPI v2.0 devices off the table and the strong desire to get rid of my SAS expander because it’s nothing but a power hog, I ended up settling for the next best thing that can drive all of my drive bays at once, a 9305-24i. Took three weeks to get here and arrived with the slot cover bent out of shape. Oh well, whatever.

mpt3sas_cm0: LSISAS3224: FWVersion(16.00.12.00), ChipRevision(0x01), BiosVersion(18.00.03.00)

And what does lspci say?

user@user-Z590-D:~$ sudo lspci -vvv -s 02:00.0
02:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3224 PCI-Express Fusion-MPT SAS-3 (rev 01)
    Subsystem: Broadcom / LSI SAS9305-24i
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Interrupt: pin A routed to IRQ 16
    Region 0: I/O ports at 4000 [size=256]
    Region 1: Memory at 53100000 (64-bit, non-prefetchable) [size=64K]
    Expansion ROM at 53000000 [disabled] [size=1M]
    Capabilities: [50] Power Management version 3
        Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [68] Express (v2) Endpoint, MSI 00
        DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
        DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
            MaxPayload 256 bytes, MaxReadReq 512 bytes
        DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
        LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported
            ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 8GT/s (ok), Width x4 (downgraded)
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range BC, TimeoutDis+ NROPrPrP- LTR-
             10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS- TPHComp- ExtTPHComp-
             AtomicOpsCap: 32bit- 64bit- 128bitCAS-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
             AtomicOpsCtl: ReqEn-
        LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
        LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
             EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [a8] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Address: 0000000000000000  Data: 0000
        Masking: 00000000  Pending: 00000000
    Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
        Vector table: BAR=1 offset=0000e000
        PBA: BAR=1 offset=0000f000
    Capabilities: [100 v2] Advanced Error Reporting
        UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
        CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
            MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
        HeaderLog: 00000000 00000000 00000000 00000000
    Capabilities: [1e0 v1] Secondary PCI Express
        LnkCtl3: LnkEquIntrruptEn- PerformEqu-
        LaneErrStat: 0
    Capabilities: [1c0 v1] Power Budgeting <?>
    Capabilities: [190 v1] Dynamic Power Allocation <?>
    Capabilities: [148 v1] Alternative Routing-ID Interpretation (ARI)
        ARICap: MFVC- ACS-, Next Function: 0
        ARICtl: MFVC- ACS-, Function Group: 0
    Kernel driver in use: mpt3sas
    Kernel modules: mpt3sas

ASPM not supported? Oh you have got to be kidding me… That would really suck. Perhaps the card isn’t running the most recent firmware? Reflashing it certainly can’t hurt. There’s plenty of tutorials out there on how to update these cards. Here’s what we need:

Here’s what my USB Stick looks like:

EFI
    BOOTX64.efi (a copy of OpenShell.efi with a different name)
sasflash3.efi
mpt3x64.rom (the UEFI BSD & HII Configuration Utility, from the 'Signed' folder)
SAS9305_24i_IT_P.bin (the firmware itself)

Time to boot to the USB…

> fs0:
> FS0:\> sas3flash.efi -listall
Avago Technologies SAS3 Flash Utility
Version 15.00.00.00 (2016.11.17)
Copyright 2008-2016 Avago Technologies. All rights reserved.

SAS3FLASH: Disconnecting the EFI Driver.
Adapter Selected is a Avago SAS: SAS3224(A1)

Num  Ctrl         FW Ver       NVDATA       x86-BIOS     PCI Addr
---------------------------------------------------------------------
0    SAS3224(A1)  16.00.12.00  10.00.00.03  08.37.02.00  00:02:00:00

Finished Processing Commands Successfully.
Exiting SAS3Flash.

SAS3FLASH: Reconnecting the EFI Driver. Please wait...
> FS0:\> sas3flash.efi -list
Avago Technologies SAS3 Flash Utility
Version 15.00.00.00 (2016.11.17)
Copyright 2008-2016 Avago Technologies. All rights reserved.

SAS3FLASH: Disconnecting the EFI Driver.
Adapter Selected is a Avago SAS: SAS3224(A1)

Controller Number:           0
Controller:                  SAS3224(A1)
PCI Address:                 00:02:00:00
SAS Address:                 500062B-2-0299-1693
NVDATA Version (Default):    10.00.00.03
NVDATA Version (Persistent): 10.00.00.03
Firmware Product ID:         0x2228 (IT)
Firmware Version:            16.00.12.00
NVDATA Vendor:               LSI
NVDATA Product ID:           SAS9305-24i
BIOS Version:                08.37.02.00
UEFI BSD Version:            18.00.03.00
FCODE Version:               N/A
Board Name:                  SAS9305-24i
Board Assembly:              03-25699-02004
Board Tracer Number:         XW84190440

Finished Processing Commands Successfully.
Exiting SAS3Flash.

SAS3FLASH: Reconnecting the EFI Driver. Please wait...
> FS0:\> sas3flash.efi -o -f SAS9305_24i_IT_P.bin -b mpt3x64.rom
Avago Technologies SAS3 Flash Utility
Version 15.00.00.00 (2016.11.17)
Copyright 2008-2016 Avago Technologies. All rights reserved.

SAS3FLASH: Disconnecting the EFI Driver.
Advanced Mode Set
Adapter Selected is a Avago SAS: SAS3224(A1)

Executing Operation: Flash Firmware Image

    Firmware Image has a Valid Checksum.
    Firmware Version 16.00.12.00
    Firmware Image compatible with Controller.

    Valid NVDATA Image found.
    NVDATA Major Version 10.00
    Checking for a compatible NVData image...

    NVDATA Device ID and Chip Revision match verified.
    NVDATA Versions Compatible.
    Valid Initialization Image verified.
    Valid BootLoader Image verified.

    Beginning Firmware Download...
    Firmware Download Successful.

    Verifiying Download...

    Firmware Flash Successful.

    Resetting Adapter...
    Adapter Successfully Reset.

    NVDATA Version 10.00.00.03

Executing Operation: Flash BIOS Image

    Validating BIOS Image...

    BIOS Header Signature is Valid

    BIOS Image has a Valid Checksum.

    BIOS PCI Structure Signature Valid.

    BIOS Image Compatible with the SAS Controller.

    Attempting to Flash BIOS Image...

    Verifying Download...

    Flash BIOS Image Successful.

Finished Processing Commands Successfully.
Exiting SAS3Flash.

SAS3FLASH: Reconnecting the EFI Driver. Please wait...

And back to Ubuntu:

> FS0:\> fs1:\EFI\ubuntu\grubx64.efi

What does lspci say now?

LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+

I can’t believe that this card outright supposed doesn’t support ASPM. By the way, it’s not just specific to my particular card. I should know, I bought a second one for troubleshooting.

Photo showing two identical 9305-24i cards

Also, what is going on with LnkCap?

I found this message on the OmniOS mailing list: [OmniOS-discuss] SAS 9305-16e HBA support in Illumos

This person’s HBA is very very similar to mine and their LnkCap shows ASPM not supported, Exit Latency L0s <2us, L1 <4us. That… doesn’t make a lot of sense.

Nvidia’s official documentation shows a Mellanox adapter with ASPM not supported, Exit Latency L0s unlimited, L1 unlimited and honestly that might even make sense. An unlimited latency for waking up out of a power saving state implies that the device simply won’t wake up once sent into that state, so it shouldn’t be used. But that HBA is just very confusing.

You know what, let’s just ignore whatever the device says and see what happens.

With aspm.py I can indeed get it into L1!

root@user-Z590-D:~$ python3 aspm.py
00:1b.4 PCI bridge: Intel Corporation Device 43c4 (rev 11)
0x34 points to 0x40
Value at 0x40 is 0x10
Found the byte at: 0x40
Adding 0x10 to the register...
Final register reads: 0x40
Byte to patch: 0x50
Byte is set to 0x40
-> ASPM_DISABLED
Value doesn't match the one we want, setting it!
Byte is set to 0x43
-> ASPM_L1_AND_L0s
02:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3224 PCI-Express Fusion-MPT SAS-3 (rev 01)
0x34 points to 0x50
Value at 0x50 is 0x1
Value is not 0x10!
Reading the next byte...
0x51 points to 0x68
Value at 0x68 is 0x10
Found the byte at: 0x68
Adding 0x10 to the register...
Final register reads: 0x40
Byte to patch: 0x78
Byte is set to 0x40
-> ASPM_DISABLED
Value doesn't match the one we want, setting it!
Byte is set to 0x43
-> ASPM_L1_AND_L0s
root@user-Z590-D:~$ lspci -vvv -s 02:00.0
02:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3224 PCI-Express Fusion-MPT SAS-3 (rev 01)
        Subsystem: Broadcom / LSI SAS9305-24i
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 16
        Region 0: I/O ports at 5000 [size=256]
        Region 1: Memory at 53100000 (64-bit, non-prefetchable) [size=64K]
        Expansion ROM at 53000000 [disabled] [size=1M]
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25W
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x4 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range BC, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [a8] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
                Vector table: BAR=1 offset=0000e000
                PBA: BAR=1 offset=0000f000
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [1e0 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [1c0 v1] Power Budgeting <?>
        Capabilities: [190 v1] Dynamic Power Allocation <?>
        Capabilities: [148 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Kernel driver in use: mpt3sas
        Kernel modules: mpt3sas

We run powertop --auto-tune again and what does the power meter say?

30W with the CPU chiling at C8! If I plug in the 10G X710-DA2 as well we end up at 34W. Those are some amazing numbers! :)

The moment of truth

This is where I want to put everything I’ve learned together. No more Ubuntu, it’s back to TrueNAS now. Alright, I prepped my USB drive. It looks like this:

EFI
    BOOT
        BOOTX64.efi (a copy of OpenShell)
        startup.nsh
ASPMEnabler.efi

The startup.nsh file looks like this:

echo -off
echo Starting ASPMEnabler...
fs0:\ASPMEnabler.efi
echo Booting into GRUB...
fs1:\EFI\debian\grubx64.efi

Booting from said USB drive gets us this:

Press ESC in 1 seconds to skip startup.nsh or any other key to continue...
Shell> echo -off
Starting ASPMEnabler...

ASPMEnabler
https://github.com/0x666690/ASPM
A modified version of: https://github.com/Jamesits/S0ixEnabler
Firmware American Megatrends Rev 327699

Table #1/14: Not RDSP
Table #2/14: Not RDSP
Table #3/14: Not RDSP
Table #4/14: Not RDSP
Table #5/14: Not RDSP
Table #6/14: Not RDSP
Table #7/14: RDSP Rev 0 @0x3B513000 | No XSDT
Table #8/14: RDSP Rev 2 @0x3B513014 | XSDT OEM ID: ALASKA Tables: 26
ACPI table #1/26: FACP Rev 6 OEM ID: ALASKA
Checking initial checksum... OK
Patching FADT table...
FADT::IaPcBootArch before: 0x11
FADT::IaPcBootArch after: 0x1
Checksum before: 0x2
Checksum after: 0x3F
Re-check... OK
FADT table patch finished
ASPMEnabler done

Booting into GRUB...
Welcome to GRUB!
root@truenas[/home/admin]# dmesg | grep ASPM
[    1.358832] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]

It works! :)

All of this is happening on a fresh TrueNAS install on my Samsung PM981 NVMe SSD. I did also have my 480GB ADATA SSD plugged in. Turns out that when you run /sbin/powertop --auto-tune with it plugged in, a small message on the TrueNAS console flashes by, telling us that things are not going well:

[  131.844759] ahci 0000:00:17.0: port does not support device sleep

The CPU package then permanently gets itself stuck in C2, when previously it would even go down to C3. Let’s get to the bottom of this message, what’s 00:17.0?

root@truenas[/home/admin]# lspci -s 00:17.0 -vvv
00:17.0 SATA controller: Intel Corporation Device 43d2 (rev 11) (prog-if 01 [AHCI 1.0])
        DeviceName: Onboard - SATA
        Subsystem: Gigabyte Technology Co., Ltd Device b005
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 127
        Region 0: Memory at 53514000 (32-bit, non-prefetchable) [size=8K]
        Region 1: Memory at 53518000 (32-bit, non-prefetchable) [size=256]
        Region 2: I/O ports at 6090 [size=8]
        Region 3: I/O ports at 6080 [size=4]
        Region 4: I/O ports at 6060 [size=32]
        Region 5: Memory at 53517000 (32-bit, non-prefetchable) [size=2K]
        Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
                Address: fee08004  Data: 0022
        Capabilities: [70] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [a8] SATA HBA v1.0 BAR4 Offset=00000004
        Kernel driver in use: ahci
        Kernel modules: ahci
ls -al /sys/block/sd*
lrwxrwxrwx 1 root root 0 Oct 29 01:14 /sys/block/sda -> ../devices/pci0000:00/0000:00:1b.4/0000:02:00.0/host0/port-0:1/end_device-0:1/target0:0:1/0:0:1:0/block/sda
lrwxrwxrwx 1 root root 0 Oct 29 01:14 /sys/block/sdb -> ../devices/pci0000:00/0000:00:1b.4/0000:02:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/block/sdb
lrwxrwxrwx 1 root root 0 Oct 29 01:14 /sys/block/sdc -> ../devices/pci0000:00/0000:00:1b.4/0000:02:00.0/host0/port-0:2/end_device-0:2/target0:0:2/0:0:2:0/block/sdc
lrwxrwxrwx 1 root root 0 Oct 29 01:14 /sys/block/sdd -> ../devices/pci0000:00/0000:00:1b.4/0000:02:00.0/host0/port-0:3/end_device-0:3/target0:0:3/0:0:3:0/block/sdd
lrwxrwxrwx 1 root root 0 Oct 29 01:14 /sys/block/sde -> ../devices/pci0000:00/0000:00:1b.4/0000:02:00.0/host0/port-0:4/end_device-0:4/target0:0:4/0:0:4:0/block/sde
lrwxrwxrwx 1 root root 0 Oct 29 01:14 /sys/block/sdf -> ../devices/pci0000:00/0000:00:1b.4/0000:02:00.0/host0/port-0:5/end_device-0:5/target0:0:5/0:0:5:0/block/sdf
lrwxrwxrwx 1 root root 0 Oct 29 01:14 /sys/block/sdg -> ../devices/pci0000:00/0000:00:17.0/ata6/host6/target6:0:0/6:0:0:0/block/sdg
lrwxrwxrwx 1 root root 0 Oct 29 01:14 /sys/block/sdh -> ../devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host8/target8:0:0/8:0:0:0/block/sdh

Ah, so the SSD is at fault. When powertop says device sleep, what it is referring to is DEVSLP. Funnily enough, the datasheet for this particular SSD explicitly lists it as a feature.

For once we can’t use lspci, but rather hdparm.

root@truenas[/home/admin]# /sbin/hdparm -I /dev/sdg

/dev/sdg:

ATA device, with non-removable media
        Model Number:       ADATA SP550
        Serial Number:      2G2220072146
        Firmware Revision:  P0330AA
        Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
        Supported: 9 8 7 6 5
        Likely used: 9
Configuration:
        Logical         max     current
        cylinders       16383   16383
        heads           16      16
        sectors/track   63      63
        --
        CHS current addressable sectors:    16514064
        LBA    user addressable sectors:   268435455
        LBA48  user addressable sectors:   937703088
        Logical  Sector size:                   512 bytes
        Physical Sector size:                  4096 bytes
        Logical Sector-0 offset:                  0 bytes
        device size with M = 1024*1024:      457862 MBytes
        device size with M = 1000*1000:      480103 MBytes (480 GB)
        cache/buffer size  = unknown
        Nominal Media Rotation Rate: Solid State Device
Capabilities:
        LBA, IORDY(can be disabled)
        Queue depth: 32
        Standby timer values: spec'd by Standard, no device specific minimum
        R/W multiple sector transfer: Max = 2   Current = 1
        DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
             Cycle time: min=120ns recommended=120ns
        PIO: pio0 pio1 pio2 pio3 pio4
             Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
        Enabled Supported:
           *    SMART feature set
                Security Mode feature set
           *    Power Management feature set
           *    Write cache
           *    Look-ahead
           *    Host Protected Area feature set
           *    WRITE_BUFFER command
           *    READ_BUFFER command
           *    NOP cmd
           *    DOWNLOAD_MICROCODE
                SET_MAX security extension
           *    48-bit Address feature set
           *    Mandatory FLUSH_CACHE
           *    FLUSH_CACHE_EXT
           *    SMART error logging
           *    SMART self-test
           *    General Purpose Logging feature set
           *    WRITE_{DMA|MULTIPLE}_FUA_EXT
           *    {READ,WRITE}_DMA_EXT_GPL commands
           *    Segmented DOWNLOAD_MICROCODE
           *    Gen1 signaling speed (1.5Gb/s)
           *    Gen2 signaling speed (3.0Gb/s)
           *    Gen3 signaling speed (6.0Gb/s)
           *    Native Command Queueing (NCQ)
           *    Host-initiated interface power management
           *    Phy event counters
           *    READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
           *    DMA Setup Auto-Activate optimization
                Device-initiated interface power management
           *    Software settings preservation
                Device Sleep (DEVSLP)
           *    SMART Command Transport (SCT) feature set
           *    SCT Write Same (AC2)
           *    SCT Features Control (AC4)
           *    SCT Data Tables (AC5)
           *    SANITIZE feature set
           *    BLOCK_ERASE_EXT command
           *    DOWNLOAD MICROCODE DMA command
           *    WRITE BUFFER DMA command
           *    READ BUFFER DMA command
           *    Data Set Management TRIM supported (limit 8 blocks)
           *    Deterministic read ZEROs after TRIM
Security:
        Master password revision code = 65534
                supported
        not     enabled
        not     locked
                frozen
        not     expired: security count
                supported: enhanced erase
        2min for SECURITY ERASE UNIT. 2min for ENHANCED SECURITY ERASE UNIT.
Device Sleep:
        DEVSLP Exit Timeout (DETO): 220 ms (drive)
        Minimum DEVSLP Assertion Time (MDAT): 31 ms (drive)
Checksum: correct

So it is supported, but not enabled and I don’t see a way to forcibly enable it, at least within the OS. It needs to be done from within the BIOS.

Settings → IO Ports → SATA And RST Configuration:
    SATA 3          ADATA SP550 (480.1GB)
        Software Preserve       SUPPORTED
        Port 3                  Enabled
        SATA Port 3 DevSlp       Disabled → Enabled
        Hot Plug                Disabled

With this setting set, powertop will not throw that particular error anymore and the CPU no longer gets stuck at C2.

Now that that is out of the way, let’s get to the real star of the show, the 9305-24i.

root@truenas[/home/admin]# python3 aspm.py
00:1b.4 PCI bridge: Intel Corporation Device 43c4 (rev 11)
0x34 points to 0x40
Value at 0x40 is 0x10
Found the byte at: 0x40
Adding 0x10 to the register...
Final register reads: 0x40
Byte to patch: 0x50
Byte is set to 0x40
-> ASPM_DISABLED
Value doesn't match the one we want, setting it!
Byte is set to 0x42
-> ASPM_L1_ONLY
02:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3224 PCI-Express Fusion-MPT SAS-3 (rev 01)
0x34 points to 0x50
Value at 0x50 is 0x1
Value is not 0x10!
Reading the next byte...
0x51 points to 0x68
Value at 0x68 is 0x10
Found the byte at: 0x68
Adding 0x10 to the register...
Final register reads: 0x40
Byte to patch: 0x78
Byte is set to 0x40
-> ASPM_DISABLED
Value doesn't match the one we want, setting it!
Byte is set to 0x42
-> ASPM_L1_ONLY
root@truenas[/home/admin]# lspci -vvv -s 02:00.0
02:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3224 PCI-Express Fusion-MPT SAS-3 (rev 01)
        Subsystem: Broadcom / LSI SAS9305-24i
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 16
        Region 0: I/O ports at 5000 [size=256]
        Region 1: Memory at 53100000 (64-bit, non-prefetchable) [size=64K]
        Expansion ROM at 53000000 [disabled] [size=1M]
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25W
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x4 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range BC, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [a8] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
                Vector table: BAR=1 offset=0000e000
                PBA: BAR=1 offset=0000f000
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [1e0 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [1c0 v1] Power Budgeting <?>
        Capabilities: [190 v1] Dynamic Power Allocation <?>
        Capabilities: [148 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Kernel driver in use: mpt3sas
        Kernel modules: mpt3sas

Moment of truth, what does powertop --auto-tune do?

Success!

68W with all disks spinning.

If we now enable HDD standby (without spindown, mind you), this goes down to 55W. Seeing how at the start of this whole ordeal the server was pulling 58W with the disks spun down and not just the heads parked, I am pretty damn happy. With that said, I wouldn’t personally let the drives park their heads this often.

Let the troubleshooting begin

From this point onwards, things did not go smoothly. Instead of presenting you with pages upon pages of dmesg output I will try to keep things short.

For the purposes of stress testing I just ran ZFS scrubs the entire time.

Some notes:

  1. L0s is not it. Setting it right after boot-up will cause the system to hang and reboot. If set without pci=nommconf (see below), the system will spam PCIe bus errors. Also, it will not drop into anything below C3.

  2. L1 works, I think, throws the same PCIe bus errors, albeit much more infrequently.

[  844.281175] pcieport 0000:00:1b.4: AER: Corrected error received: 0000:02:00.0
[  844.281206] mpt3sas 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[  844.281210] mpt3sas 0000:02:00.0:   device [1000:00c4] error status/mask=00001000/00002000
[  844.281213] mpt3sas 0000:02:00.0:    [12] Timeout
[  849.878827] pcieport 0000:00:1b.4: AER: Corrected error received: 0000:02:00.0
[  849.878852] mpt3sas 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[  849.878857] mpt3sas 0000:02:00.0:   device [1000:00c4] error status/mask=00001000/00002000
[  849.878862] mpt3sas 0000:02:00.0:    [12] Timeout

With that said, adding pci=nommconf to the boot arguments in GRUB appears to have made these error messages disappear. I’ll be the first one to admit that I don’t exactly understand the consequences of setting it, but from reading this and this I feel significantly better about pci=nommconf than I do about pci=noaer, which was also a suggestion I found online, but that just hides all the errors lol

We can add it to the TrueNAS boot arguments like this:

midclt call system.advanced.update '{"kernel_extra_options": "pci=nommconf"}'
  1. You’re gonna wanna enable logging when first tinkering with this. Don’t go in blind like I initially did.

The mpt3sas driver doesn’t exactly provide a lot of information in its default configuration. Luckily there’s a parameter for it. If you wanna know about all the other parameters, just run /sbin/modinfo mpt3sas. For descriptions as to what each parameter does it might be worth looking at the official user documentation for the driver.

A reasonable option for initial testing would be

echo 0x3f8 > /sys/module/mpt3sas/parameters/logging_level

All of the pre-defined values for the logging_level can be found at the top of mptdebug.h. The value 0x3f8 is what one of the Broadcom engineers in that mpt2sas bug thread wanted output from, so I figured it would be a good choice for me as well.

It essentially turns this:

[58307.970617] mpt3sas_cm0: log_info(0x3112043e): originator(PL), code(0x12), sub_code(0x043e)
[58307.970660] mpt3sas_cm0: log_info(0x3112043e): originator(PL), code(0x12), sub_code(0x043e)
[58307.970684] mpt3sas_cm0: log_info(0x3112043e): originator(PL), code(0x12), sub_code(0x043e)
[58307.970741] mpt3sas_cm0: log_info(0x3112043e): originator(PL), code(0x12), sub_code(0x043e)
[58307.970889] mpt3sas_cm0: log_info(0x3112043e): originator(PL), code(0x12), sub_code(0x043e)

into this:

[58307.796312] mpt3sas_cm0: Device Status Change
[58307.796337] mpt3sas_cm0: Enable tm_busy flag for handle(0x001c)
[58307.796422] mpt3sas_cm0: device status change: (internal device reset)       handle(0x001c), sas address(0x300062b202991698), tag(65535)
[58307.803237] mpt3sas_cm0: Device Status Change
[58307.803261] mpt3sas_cm0: Enable tm_busy flag for handle(0x001c)
[58307.803344] mpt3sas_cm0: device status change: (internal device reset)       handle(0x001c), sas address(0x300062b202991698), tag(65535)
[58307.970553] mpt3sas_cm0: Device Status Change
[58307.970555] sd 0:0:3:0: [sdd] tag#1889 CDB: Read(16) 88 00 00 00 00 02 a6 d4 f7 98 00 00 00 40 00 00
[58307.970555] sd 0:0:3:0: [sdd] tag#1885 CDB: Read(16) 88 00 00 00 00 02 a6 d4 e7 98 00 00 04 98 00 00
[58307.970559] sd 0:0:3:0: [sdd] tag#1888 CDB: Read(16) 88 00 00 00 00 02 a6 d4 f4 08 00 00 03 90 00 00
[58307.970566] mpt3sas_cm0:     sas_address(0x300062b202991698), phy(5)
[58307.970567] mpt3sas_cm0:     sas_address(0x300062b202991698), phy(5)
[58307.970568] mpt3sas_cm0:     sas_address(0x300062b202991698), phy(5)
[58307.970575] mpt3sas_cm0: enclosure logical id(0x500062b202991693), slot(6)
[58307.970576] mpt3sas_cm0: enclosure logical id(0x500062b202991693), slot(6)
[58307.970577] mpt3sas_cm0: enclosure logical id(0x500062b202991693), slot(6)
[58307.970581] mpt3sas_cm0: enclosure level(0x0000), connector name(     )
[58307.970584] mpt3sas_cm0: enclosure level(0x0000), connector name(     )
[58307.970587] mpt3sas_cm0:     handle(0x001c), ioc_status(unknown)(0x0003), smid(1886)
[58307.970589] mpt3sas_cm0: Disable tm_busy flag for handle(0x001c)
[58307.970593] mpt3sas_cm0:     request_len(602112), underflow(602112), resid(73720)
[58307.970599] mpt3sas_cm0: enclosure level(0x0000), connector name(     )
[58307.970599] mpt3sas_cm0:     tag(2), transfer_count(528392), sc->result(0x000b0000)
[58307.970605] mpt3sas_cm0: Device Status Change
[58307.970605] mpt3sas_cm0:     scsi_status(good)(0x00), scsi_state(state terminated no status )(0x0c)
[58307.970609] mpt3sas_cm0: Disable tm_busy flag for handle(0x001c)
[58307.970614] mpt3sas_cm0:     handle(0x001c), ioc_status(scsi ioc terminated)(0x004b), smid(1889)
[58307.970617] mpt3sas_cm0: log_info(0x3112043e): originator(PL), code(0x12), sub_code(0x043e)
[58307.970621] mpt3sas_cm0:     request_len(466944), underflow(466944), resid(466944)
[58307.970625] mpt3sas_cm0:     handle(0x001c), ioc_status(scsi ioc terminated)(0x004b), smid(1890)
[58307.970628] mpt3sas_cm0:     tag(65535), transfer_count(0), sc->result(0x000b0000)
[58307.970635] mpt3sas_cm0:     scsi_status(good)(0x00), scsi_state(state terminated no status )(0x0c)
[58307.970640] sd 0:0:3:0: [sdd] tag#1886 CDB: Read(16) 88 00 00 00 00 02 a6 d4 ec 30 00 00 03 68 00 00
[58307.970645] mpt3sas_cm0:     sas_address(0x300062b202991698), phy(5)
[58307.970646] mpt3sas_cm0:     request_len(32768), underflow(32768), resid(32768)
[58307.970652] mpt3sas_cm0:     tag(65535), transfer_count(0), sc->result(0x000b0000)
[58307.970660] mpt3sas_cm0: log_info(0x3112043e): originator(PL), code(0x12), sub_code(0x043e)
[58307.970663] mpt3sas_cm0: enclosure logical id(0x500062b202991693), slot(6)
[58307.970667] mpt3sas_cm0: enclosure level(0x0000), connector name(     )
[58307.970671] mpt3sas_cm0:     handle(0x001c), ioc_status(scsi ioc terminated)(0x004b), smid(1887)
[58307.970673] mpt3sas_cm0:     scsi_status(good)(0x00), scsi_state(state terminated no status )(0x0c)
[58307.970684] sd 0:0:3:0: [sdd] tag#1887 CDB: Read(16) 88 00 00 00 00 02 a6 d4 ef 98 00 00 04 70 00 00
[58307.970684] mpt3sas_cm0: log_info(0x3112043e): originator(PL), code(0x12), sub_code(0x043e)
[58307.970689] mpt3sas_cm0:     sas_address(0x300062b202991698), phy(5)
[58307.970692] mpt3sas_cm0:     request_len(446464), underflow(446464), resid(446464)
[58307.970694] mpt3sas_cm0: enclosure logical id(0x500062b202991693), slot(6)
[58307.970697] mpt3sas_cm0: enclosure level(0x0000), connector name(     )
[58307.970702] mpt3sas_cm0:     handle(0x001c), ioc_status(scsi ioc terminated)(0x004b), smid(1888)
[58307.970708] mpt3sas_cm0:     tag(3), transfer_count(0), sc->result(0x000b0000)
[58307.970717] mpt3sas_cm0:     request_len(581632), underflow(581632), resid(581632)
[58307.970723] mpt3sas_cm0:     scsi_status(good)(0x00), scsi_state(state terminated no status )(0x0c)
[58307.970731] mpt3sas_cm0:     tag(65535), transfer_count(0), sc->result(0x000b0000)
[58307.970741] mpt3sas_cm0: log_info(0x3112043e): originator(PL), code(0x12), sub_code(0x043e)
[58307.970747] mpt3sas_cm0:     scsi_status(good)(0x00), scsi_state(state terminated no status )(0x0c)
[58307.970889] mpt3sas_cm0: log_info(0x3112043e): originator(PL), code(0x12), sub_code(0x043e)
[58307.970969] mpt3sas_cm0: device status change: (internal device reset complete)      handle(0x001c), sas address(0x300062b202991698), tag(65535)
[58307.971006] mpt3sas_cm0: device status change: (internal device reset complete)      handle(0x001c), sas address(0x300062b202991698), tag(65535)

I did later on increase the verbosity some more and settled on 0x23f8.

Side note: no matter how you combine these values, don’t enable MPT_DEBUG_SCSI. It prints every read or write command to the console…

Another thing about parameters: do not turn on mpt3sas_fwfault_debug unless you have the hardware to pull data from the UART on the HBA. If the device ever gets to mpt3sas_base_hard_reset_handler it will enter mpt3sas_halt_firmware, give you a stack trace and then leave you wondering why your system panicked. Not to mention that this stack trace doesn’t exactly contain a lot of information to begin with:

[ 3179.805528] mpt3sas_cm0: mpt3sas_base_hard_reset_handler: enter
[ 3179.806662] CPU: 5 PID: 19596 Comm: kworker/u16:0 Tainted: P           OE      6.1.55-debug+truenas #2
[ 3179.807792] Hardware name: Gigabyte Technology Co., Ltd. Z590 D/Z590 D, BIOS F7a 01/24/2022
[ 3179.808933] Workqueue: poll_mpt3sas0_statu _base_fault_reset_work [mpt3sas]
[ 3179.810093] Call Trace:
[ 3179.811225]  <TASK>
[ 3179.812357]  dump_stack_lvl+0x44/0x5c
[ 3179.813507]  mpt3sas_halt_firmware.part.0+0xf/0xb3 [mpt3sas]
[ 3179.814658]  mpt3sas_base_hard_reset_handler.cold+0x221/0x221 [mpt3sas]
[ 3179.815849]  _base_fault_reset_work+0x292/0x2a0 [mpt3sas]
[ 3179.817008]  process_one_work+0x1c4/0x380
[ 3179.818159]  worker_thread+0x4d/0x380
[ 3179.819304]  ? _raw_spin_lock_irqsave+0x23/0x50
[ 3179.820457]  ? rescuer_thread+0x3a0/0x3a0
[ 3179.821620]  kthread+0xe6/0x110
[ 3179.822763]  ? kthread_complete_and_exit+0x20/0x20
[ 3179.823911]  ret_from_fork+0x1f/0x30
[ 3179.825062]  </TASK>
[ 3179.826252] mpt3sas_cm0 fault info from func: mpt3sas_halt_firmware
  1. Use of powertop --auto-tune is strongly discouraged.

Using it will automatically enable all of the tunables, including those not needed to get past C2/C3 which will just give you headaches when debugging. It will also mess with the power management for each of your drives, which leads to faults which leads to the server essentially hanging itself.

Here are all the tunables I ended up setting by hand:

# Runtime PM for PCI Device Intel Corporation Ethernet Controller X710 for 10GbE SFP+
echo 'auto' > '/sys/bus/pci/devices/0000:05:00.0/power/control';
echo 'auto' > '/sys/bus/pci/devices/0000:05:00.1/power/control';

# Runtime PM for PCI Device Intel Corporation 10th Gen Core Processor Host Bridge/DRAM Registers
echo 'auto' > '/sys/bus/pci/devices/0000:00:00.0/power/control';

# Runtime PM for PCI Device Intel Corporation Tiger Lake-H SPI Controller
echo 'auto' > '/sys/bus/pci/devices/0000:00:1f.5/power/control';

# Runtime PM for PCI Device Broadcom / LSI SAS3224 PCI-Express Fusion-MPT SAS-3
#echo 'auto' > '/sys/bus/pci/devices/0000:02:00.0/power/control';

# Runtime PM for PCI Device Intel Corporation Tiger Lake-H PCI Express Root Port #9
echo 'auto' > '/sys/bus/pci/devices/0000:00:1d.0/power/control';

# Runtime PM for PCI Device Intel Corporation Tiger Lake-H Shared SRAM
echo 'auto' > '/sys/bus/pci/devices/0000:00:14.2/power/control';

# Runtime PM for I2C Adapter i2c-1 (i915 gmbus dpa)
echo 'auto' > '/sys/bus/i2c/devices/i2c-1/device/power/control';

# Autosuspend for USB device ITE Device [ITE Tech. Inc.]
echo 'auto' > '/sys/bus/usb/devices/1-13/power/control';

# Autosuspend for USB device USB DISK 2.0 [        ]
echo 'auto' > '/sys/bus/usb/devices/1-3/power/control';

# Runtime PM for PCI Device Intel Corporation Device 4385
echo 'auto' > '/sys/bus/pci/devices/0000:00:1f.0/power/control';

# Runtime PM for PCI Device of the built-in SATA controller
echo 'auto' > '/sys/bus/pci/devices/0000:00:17.0/power/control';

# Runtime PM for the built-in SATA controller
echo 'auto' > '/sys/bus/pci/devices/0000:00:17.0/ata1/power/control';
echo 'auto' > '/sys/bus/pci/devices/0000:00:17.0/ata2/power/control';
echo 'auto' > '/sys/bus/pci/devices/0000:00:17.0/ata3/power/control';
echo 'auto' > '/sys/bus/pci/devices/0000:00:17.0/ata4/power/control';
echo 'auto' > '/sys/bus/pci/devices/0000:00:17.0/ata5/power/control';
echo 'auto' > '/sys/bus/pci/devices/0000:00:17.0/ata6/power/control';

# NMI watchdog should be turned off
echo '0' > '/proc/sys/kernel/nmi_watchdog';

# VM writeback timeout
echo '1500' > '/proc/sys/vm/dirty_writeback_centisecs';

# Enable SATA link power management for the built-in sata controller
echo 'med_power_with_dipm' > '/sys/class/scsi_host/host1/link_power_management_policy';
echo 'med_power_with_dipm' > '/sys/class/scsi_host/host2/link_power_management_policy';
echo 'med_power_with_dipm' > '/sys/class/scsi_host/host3/link_power_management_policy';
echo 'med_power_with_dipm' > '/sys/class/scsi_host/host4/link_power_management_policy';
echo 'med_power_with_dipm' > '/sys/class/scsi_host/host5/link_power_management_policy';
echo 'med_power_with_dipm' > '/sys/class/scsi_host/host6/link_power_management_policy';
  1. Do not enable APM. Explicitly disable any kind of power management for the drives.

  2. The system will regularly query information about the state of the drives. In dmesg that looks like this:

[32461.263672] sd 0:0:4:0: [sde] tag#5697 CDB: ATA command pass through(16) 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00
[32461.264991] mpt3sas_cm0:     sas_address(0x300062b202991699), phy(6)
[32461.266196] mpt3sas_cm0: enclosure logical id(0x500062b202991693), slot(4)
[32461.267398] mpt3sas_cm0: enclosure level(0x0000), connector name(     )
[32461.268593] mpt3sas_cm0:     handle(0x001d), ioc_status(success)(0x0000), smid(5698)
[32461.269796] mpt3sas_cm0:     request_len(0), underflow(0), resid(0)
[32461.270994] mpt3sas_cm0:     tag(0), transfer_count(0), sc->result(0x00000002)
[32461.272192] mpt3sas_cm0:     scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
[32461.273400] mpt3sas_cm0:     [sense_key,asc,ascq]: [0x01,0x00,0x1d], count(22)

Those messages can safely be ignored.

  1. It appears that setting mpt3sas.msix_disable=1 helps with the system’s stability. I don’t exactly understand why, but in terms of functionality we’re not losing anything so it’s whatever.

  2. With all the correct tunables set, the system will go down as far as C10. I ended up disabling C10 because it lead to fault states that the system was unable to recover itself from. Settting the lowest C-state to C6/7 did not change anything.

  3. I’ve run into multiple fault states, namely fault_state(0x2623) and fault_state(0x5854), easily recognized by the red text in dmesg. Both of these lead to the controller resetting itself. It can take up to ten seconds for the storage pool it is attached to to return into an operational state. My ZFS array appeared to simply not care about any of that happening. No errors, no warnings, nada.

  4. Messages about devices being reset and devices changing state are relatively common. These do not appear to mess with the rest of the system itself.

A device reset looks a little like this:

[ 2127.945563] mpt3sas_cm0: Device Status Change
[ 2127.947874] mpt3sas_cm0: Enable tm_busy flag for handle(0x0019)
[ 2127.949020] mpt3sas_cm0: device status change: (internal device reset)       handle(0x0019), sas address(0x300062b202991695), tag(65535)
[ 2127.951658] mpt3sas_cm0: Device Status Change
[ 2127.952804] mpt3sas_cm0: Enable tm_busy flag for handle(0x0019)
[ 2127.953938] mpt3sas_cm0: device status change: (internal device reset)       handle(0x0019), sas address(0x300062b202991695), tag(65535)
[ 2128.109209] sd 0:0:0:0: [sda] tag#2303 CDB: Read(16) 88 00 00 00 00 02 de 5b e0 88 00 00 02 08 00 00
[ 2128.112871] mpt3sas_cm0:     sas_address(0x300062b202991695), phy(2)
[ 2128.116758] mpt3sas_cm0: enclosure logical id(0x500062b202991693), slot(0)
[ 2128.120787] mpt3sas_cm0: enclosure level(0x0000), connector name(     )
[ 2128.124813] mpt3sas_cm0:     handle(0x0019), ioc_status(scsi ioc terminated)(0x004b), smid(2304)
[ 2128.128846] mpt3sas_cm0:     request_len(266240), underflow(266240), resid(266240)
[ 2128.132875] mpt3sas_cm0:     tag(1), transfer_count(0), sc->result(0x000b0000)
[ 2128.136884] mpt3sas_cm0:     scsi_status(good)(0x00), scsi_state(state terminated no status )(0x0c)
[ 2128.140891] mpt3sas_cm0: log_info(0x3112043e): originator(PL), code(0x12), sub_code(0x043e)
[ 2128.144736] sd 0:0:0:0: [sda] tag#2240 CDB: Read(16) 88 00 00 00 00 02 de 5b e2 90 00 00 08 00 00 00
[ 2128.147508] mpt3sas_cm0:     sas_address(0x300062b202991695), phy(2)
[ 2128.150063] mpt3sas_cm0: enclosure logical id(0x500062b202991693), slot(0)
[ 2128.152161] mpt3sas_cm0: enclosure level(0x0000), connector name(     )
[ 2128.154248] mpt3sas_cm0:     handle(0x0019), ioc_status(scsi ioc terminated)(0x004b), smid(2241)
[ 2128.155981] mpt3sas_cm0:     request_len(1048576), underflow(1048576), resid(1048576)
[ 2128.157661] mpt3sas_cm0:     tag(2), transfer_count(0), sc->result(0x000b0000)
[ 2128.159342] mpt3sas_cm0:     scsi_status(good)(0x00), scsi_state(state terminated no status )(0x0c)
[ 2128.160760] mpt3sas_cm0: log_info(0x3112043e): originator(PL), code(0x12), sub_code(0x043e)
[ 2128.162180] sd 0:0:0:0: [sda] tag#2296 CDB: Read(16) 88 00 00 00 00 02 de 5b da 90 00 00 05 f8 00 00
[ 2128.163596] mpt3sas_cm0:     sas_address(0x300062b202991695), phy(2)
[ 2128.164936] mpt3sas_cm0: enclosure logical id(0x500062b202991693), slot(0)
[ 2128.166163] mpt3sas_cm0: enclosure level(0x0000), connector name(     )
[ 2128.167378] mpt3sas_cm0:     handle(0x0019), ioc_status(unknown)(0x0003), smid(2297)
[ 2128.168587] mpt3sas_cm0:     request_len(782336), underflow(782336), resid(99320)
[ 2128.169784] mpt3sas_cm0:     tag(0), transfer_count(683016), sc->result(0x000b0000)
[ 2128.170925] mpt3sas_cm0:     scsi_status(good)(0x00), scsi_state(state terminated no status )(0x0c)
[ 2128.172079] mpt3sas_cm0: log_info(0x3112043e): originator(PL), code(0x12), sub_code(0x043e)
[ 2128.173223] sd 0:0:0:0: [sda] tag#2246 CDB: Read(16) 88 00 00 00 00 02 de 5b ee 90 00 00 04 00 00 00
[ 2128.174367] mpt3sas_cm0:     sas_address(0x300062b202991695), phy(2)
[ 2128.175514] mpt3sas_cm0: enclosure logical id(0x500062b202991693), slot(0)
[ 2128.176666] mpt3sas_cm0: enclosure level(0x0000), connector name(     )
[ 2128.177811] mpt3sas_cm0:     handle(0x0019), ioc_status(scsi ioc terminated)(0x004b), smid(2247)
[ 2128.178956] mpt3sas_cm0:     request_len(524288), underflow(524288), resid(524288)
[ 2128.180104] mpt3sas_cm0:     tag(65535), transfer_count(0), sc->result(0x000b0000)
[ 2128.181257] mpt3sas_cm0:     scsi_status(good)(0x00), scsi_state(state terminated no status )(0x0c)
[ 2128.182403] mpt3sas_cm0: log_info(0x3112043e): originator(PL), code(0x12), sub_code(0x043e)
[ 2128.183549] sd 0:0:0:0: [sda] tag#2245 CDB: Read(16) 88 00 00 00 00 02 de 5b ea 90 00 00 04 00 00 00
[ 2128.184703] mpt3sas_cm0:     sas_address(0x300062b202991695), phy(2)
[ 2128.185853] mpt3sas_cm0: enclosure logical id(0x500062b202991693), slot(0)
[ 2128.186994] mpt3sas_cm0: enclosure level(0x0000), connector name(     )
[ 2128.188138] mpt3sas_cm0:     handle(0x0019), ioc_status(scsi ioc terminated)(0x004b), smid(2246)
[ 2128.189283] mpt3sas_cm0:     request_len(524288), underflow(524288), resid(524288)
[ 2128.190427] mpt3sas_cm0:     tag(65535), transfer_count(0), sc->result(0x000b0000)
[ 2128.191571] mpt3sas_cm0:     scsi_status(good)(0x00), scsi_state(state terminated no status )(0x0c)
[ 2128.192727] mpt3sas_cm0: log_info(0x3112043e): originator(PL), code(0x12), sub_code(0x043e)
[ 2128.193878] mpt3sas_cm0: Device Status Change
[ 2128.195023] mpt3sas_cm0: Disable tm_busy flag for handle(0x0019)
[ 2128.196162] mpt3sas_cm0: Device Status Change
[ 2128.196168] mpt3sas_cm0: device status change: (internal device reset complete)      handle(0x0019), sas address(0x300062b202991695), tag(65535)
[ 2128.197535] mpt3sas_cm0: Disable tm_busy flag for handle(0x0019)

[ 2128.201029] mpt3sas_cm0: device status change: (internal device reset complete)      handle(0x0019), sas address(0x300062b202991695), tag(65535)
[ 2128.820155] mpt3sas_cm0: Discovery: (start)
[ 2128.823241] mpt3sas_cm0: SAS Topology Change List
[ 2128.823272] mpt3sas_cm0: discovery event: (start)
[ 2128.824771] mpt3sas_cm0: Discovery: (stop)
[ 2128.826145] mpt3sas_cm0: sas topology change: (responding)
[ 2128.827734] sd 0:0:0:0: [sda] tag#2248 CDB: Read(16) 88 00 00 00 00 02 de 5b ee 90 00 00 04 00 00 00
[ 2128.827737] mpt3sas_cm0:     sas_address(0x300062b202991695), phy(2)
[ 2128.827738] mpt3sas_cm0: enclosure logical id(0x500062b202991693), slot(0)
[ 2128.827740] mpt3sas_cm0: enclosure level(0x0000), connector name(     )
[ 2128.827741] mpt3sas_cm0:     handle(0x0019), ioc_status(success)(0x0000), smid(2249)
[ 2128.827743] mpt3sas_cm0:     request_len(524288), underflow(524288), resid(262144)
[ 2128.827744] mpt3sas_cm0:     tag(65535), transfer_count(262144), sc->result(0x00000000)
[ 2128.827745] mpt3sas_cm0:     scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
[ 2128.829130]  handle(0x0000), enclosure_handle(0x0001) start_phy(02), count(1)
[ 2128.830555] mpt3sas_cm0:     [sense_key,asc,ascq]: [0x06,0x29,0x00], count(18)
[ 2128.832270]  phy(02), attached_handle(0x0019): link rate change: link rate: new(0x0a), old(0x0a)
[ 2128.833698] sd 0:0:0:0: [sda] tag#2248 CDB: Read(16) 88 00 00 00 00 02 de 5b ee 90 00 00 04 00 00 00
[ 2128.835189] mpt3sas_cm0: updating handles for sas_host(0x500062b202991693)
[ 2128.836617] mpt3sas_cm0:     sas_address(0x300062b202991695), phy(2)
[ 2128.847671] mpt3sas_cm0: enclosure logical id(0x500062b202991693), slot(0)
[ 2128.848871] mpt3sas_cm0: enclosure level(0x0000), connector name(     )
[ 2128.850066] mpt3sas_cm0:     handle(0x0019), ioc_status(success)(0x0000), smid(2249)
[ 2128.851273] mpt3sas_cm0:     request_len(524288), underflow(524288), resid(262144)
[ 2128.852469] mpt3sas_cm0:     tag(65535), transfer_count(262144), sc->result(0x00000002)
[ 2128.853666] mpt3sas_cm0:     scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
[ 2128.854856] mpt3sas_cm0:     [sense_key,asc,ascq]: [0x06,0x29,0x00], count(18)
[ 2128.856047] sd 0:0:0:0: Power-on or device reset occurred
[ 2128.857760] mpt3sas_cm0: discovery event: (stop)
[ 2128.886501] sd 0:0:0:0: [sda] tag#2317 CDB: ATA command pass through(12)/Blank a1 08 2e 00 01 00 00 00 00 ec 00 00
[ 2128.890755] mpt3sas_cm0:     sas_address(0x300062b202991695), phy(2)
[ 2128.894975] mpt3sas_cm0: enclosure logical id(0x500062b202991693), slot(0)
[ 2128.898648] mpt3sas_cm0: enclosure level(0x0000), connector name(     )
[ 2128.901549] mpt3sas_cm0:     handle(0x0019), ioc_status(success)(0x0000), smid(2318)
[ 2128.903707] mpt3sas_cm0:     request_len(512), underflow(0), resid(0)
[ 2128.904900] mpt3sas_cm0:     tag(0), transfer_count(512), sc->result(0x00000002)
[ 2128.906087] mpt3sas_cm0:     scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
[ 2128.907270] mpt3sas_cm0:     [sense_key,asc,ascq]: [0x01,0x00,0x1d], count(22)

Someone please explain to me how the 0x0a and 0x0a in link rate change: link rate: new(0x0a), old(0x0a) are two different numbers lol

Conclusion

Here’s my final setup:

This server’s state went from “everything works but it’s a power hog” to “machine exhibits strange behaviour tolerated by the operating system but at least we’re saving power”. I’m not entirely sure how I’m supposed to feel about that. Part of me would like to go out and buy a 9600-24i for upwards of 1300€ after seeing one of the Tri-Mode MegaRAID controllers get to L1 on its own but I really don’t want to spend that sort of cash.

For now I guess this is fine. I can live with my server the way it is now. Should things ever get worse I will either update this post or take it down entirely. It’s not like anyone would notice lol

If you’ve made it this far and still want to go ahead with all of this, by all means do it. You clearly care just as little about your production environment as I do.