Reliable projections of extremes by climate models are becoming increasingly important in the context of climate change and associated societal impacts. Extremes are by definition rare events, characterized by a small sample associated with large uncertainties. The evaluation of extreme events in model simulations thus requires performance measures that compare full distributions rather than simple summaries. This paper proposes the use of the integrated quadratic distance (IQD) for this purpose. The IQD is applied to evaluate CMIP5 and CMIP6 simulations of monthly maximum and minimum near-surface air temperature over Europe and North America against both observation-based data and reanalyses. Several climate models perform well to the extent that these models' performance is competitive with the performance of another data product in simulating the evaluation set. While the model rankings vary with region, season and index, the model evaluation is robust against changes in the grid resolution considered in the analysis. When the model simulations are ranked based on their similarity with the ERA5 reanalysis, more CMIP6 than CMIP5 models appear at the top of the ranking. When evaluated against the HadEX2 data product, the overall performance of the two model ensembles is similar.