Battery fires are rare, but when one happens it makes the news. Because there have been so few, most of what the public knows about BESS failures is anecdotal. EPRI has been quietly changing that. Its Battery Energy Storage Systems Failure Incident Database, now past 81 logged incidents and 26 fully classified root-cause investigations, is the best public dataset on where, how, and why grid-scale batteries actually fail. Two headlines from the most recent release: the failure rate per installed gigawatt-hour has dropped by 97% between 2018 and 2023, and most of the remaining failures do not happen in the cells.
The trajectory
EPRI’s failure-rate tracking, normalised against installed capacity, shows the trajectory falling from roughly 10 incidents per GW in 2018 to well under 1 per GW by 2023. The absolute count of incidents has risen modestly as the installed fleet has grown, but the intensity per unit of deployed energy has collapsed by two orders of magnitude. That is not a story about cells becoming fundamentally safer overnight - it is a story about industrialisation: better commissioning procedures, better fire-suppression design, standardised testing, and a rapid shift to LFP chemistry across grid-scale procurement.
When failures actually happen
The single most surprising finding in the EPRI dataset is temporal: 72% of failures occur during construction, commissioning, or within the first two years of operation. Mid-life failures - year 3 to year 10 - are rare. Late-life knee-point failures are typically bounded by operational policy rather than catastrophic events. The implication is that the riskiest period for a BESS is not the end of its design life but the beginning. EPC quality, commissioning rigour, and first-year operational procedures are the highest-leverage interventions for fleet safety.
Where failures come from
Of the 26 failures classified in detail, the root causes cluster not where most of the public commentary assumes. Only 11% were attributed to cell-level defects. 26% came from issues in the fire-suppression or fire-protection system, 18% from thermal management (cooling, HVAC, thermal runaway propagation), and the rest from a combination of BOS components, EMS firmware, installation errors, and site-level integration issues. The pattern is consistent across North American and European incidents in the database: modern cells, well selected and operated inside their envelope, rarely initiate failures on their own. What turns a contained cell event into a system-level incident is the response system around it.
What operators actually do differently now
Four operational changes have tracked the failure-rate collapse. First, large-format prismatic LFP has replaced most NMC in new-build grid-scale systems, raising the thermal-runaway onset temperature from 150–200°C to roughly 270°C and reducing flammable-gas release per cell. Second, NFPA 855 and UL 9540A testing are now effectively industry-standard; the tests specifically stress propagation between modules and are used to tune enclosure venting. Third, multi-stage fire detection - gas sensing before smoke detection, thermal-runaway precursor detection before gas - has moved from optional to default in European insurer requirements. Fourth, commissioning procedures have been formalised: 72 hours of monitored continuous operation under representative duty cycles before handover is now a common EPC contractual milestone.
What the database still cannot tell us
The EPRI dataset covers incidents that were reported and investigated. There is a plausible under-reporting bias for smaller thermal events that did not escalate, and a larger one for operational issues (capacity fade beyond specification, repeated parasitic-load excursions) that do not rise to “incident” status. The granular early-warning dataset - what happens before a cell hits thermal runaway - lives mostly in proprietary telemetry and has not been pooled. Several European TSOs and regulators are discussing a common reporting template in 2025–2026, which would materially improve the shared picture.
What it means for procurement
The EPRI database is the cleanest empirical answer to the question insurers and lenders ask: is this asset class getting safer? The answer is yes, decisively, and the improvements trace to specific engineering and procedural changes. For a developer, the practical takeaway is that the cheapest safety intervention available is EPC quality during commissioning - where 72% of failures originate. That is where procurement, warranty language, and commissioning acceptance procedures deserve the most attention. The cells will be fine; the question is whether everything around them is.