IMRT planning with commercial Treatment Planning Systems (TPSs) is a trial-and-error process. Consequently, the quality of treatment plans may not be consistent among patients, planners and institutions. Recently, different plan quality assurance (QA) models have been proposed, that could flag and guide improvement of suboptimal treatment plans. However, the performance of these models was validated using plans that were created using the conventional trail-and-error treatment planning process. Consequently, it is challenging to assess and compare quantitatively the accuracy of different treatment planning QA models. Therefore, we created a golden standard dataset of consistently planned Pareto-optimal IMRT plans for 115 prostate patients. Next, the dataset was used to assess the performance of a treatment planning QA model that uses the overlap volume histogram (OVH). 115 prostate IMRT plans were fully automatically planned using our in-house developed TPS Erasmus-iCycle. An existing OVH model was trained on the plans of 58 of the patients. Next it was applied to predict DVHs of the rectum, bladder and anus of the remaining 57 patients. The predictions were compared with the achieved values of the golden standard plans for the rectum D-mean, V-65, and V-75, and D-mean of the anus and the bladder. For the rectum, the prediction errors (predicted-achieved) were only -0.2 +/- 0.9 Gy ( mean +/- 1 SD) for Dmean, - 1.0 +/- 1.6% for V-65, and - 0.4 +/- 1.1% for V-75. For D-mean of the anus and the bladder, the prediction error was 0.1 +/- 1.6 Gy and 4.8 +/- 4.1 Gy, respectively. Increasing the training cohort to 114 patients only led to minor improvements. A dataset of consistently planned Pareto-optimal prostate IMRT plans was generated. This dataset can be used to train new, and validate and compare existing treatment planning QA models, and has been made publicly available. The OVH model was highly accurate in predicting rectum and anus DVHs. For the bladder, larger prediction errors were observed.