virtual screening enrichment rate benchmarking DUD-E docking validation

Benchmarking Virtual Screening: What Enrichment Rates Actually Tell You

2024-09-11 Tomás Vidal — Computational Chemistry Lead 9 min read

An AUC-ROC of 0.8 on a DUD-E benchmark does not mean your virtual screen will find three kinase hits in a library of ten thousand compounds. What enrichment metrics measure, what they miss, and how to set realistic expectations before a campaign starts.

The Benchmark-to-Campaign Translation Problem

Virtual screening benchmarks exist because it is difficult to compare methods directly on prospective campaigns — you rarely know the ground truth hit rate of a real library against a real target before you run the experiment. DUD-E (Directory of Useful Decoys, Enhanced) and LIT-PCBA (Literature-curated Potent and Diverse Actives) are the most widely used retrospective benchmark sets in the field. Both assess how well a virtual screening method can distinguish known actives from property-matched decoys when the known actives are held out and the method is blind to their identity.

AUC-ROC (area under the receiver operating characteristic curve) and EF (enrichment factor) at 1% or 5% of the screened library are the standard metrics. A method with AUC-ROC of 0.8 recovers 80% of the area under the true positive rate versus false positive rate curve — it is substantially better than random (0.5) at separating actives from decoys. An EF1% of 10 means that the top 1% of the ranked list contains 10 times the concentration of actives that a random selection from the same library would contain.

These metrics are useful for method comparison on benchmark sets. They are not reliable predictors of absolute hit rate in a prospective campaign, for reasons that are structural and physicochemical rather than methodological. Understanding those reasons is necessary for setting realistic expectations before a campaign, and for interpreting performance during it.

Why DUD-E AUC Does Not Predict Prospective Hit Rate

DUD-E decoys are selected to match the physicochemical properties of the actives (MW, clogP, rotatable bonds, H-bond donor/acceptor counts) while being topologically dissimilar by Tanimoto distance. This controls for the obvious confound — methods that simply rank by molecular weight or clogP would otherwise score well against randomly selected decoys. But property-matching introduces a different problem: the decoys are physicochemically similar to the actives but are assumed to be non-binders. This assumption is imperfect. Published analyses have shown that a non-trivial fraction of DUD-E decoys are likely binders that were never tested, producing artificially inflated AUC values for methods that incidentally rank these false decoys correctly.

LIT-PCBA was designed to address DUD-E's artificial enrichment by using actives and decoys from actual biochemical assay data (PubChem BioAssay), where non-binders are experimentally confirmed as inactive rather than assumed by topological distance. LIT-PCBA AUC values are consistently lower than DUD-E values for the same methods — typically by 0.05-0.15 AUC units depending on target class. This is not because the methods got worse; it is because the benchmark is harder and more realistic.

Even LIT-PCBA AUC does not translate directly to prospective hit rate because the benchmark actives are drawn from diverse public compound databases and include chemotypes that may not appear in your specific commercial library. A method that achieves EF1% of 8 on the LIT-PCBA CDK2 set may achieve EF1% of 3-5 against a commercial fragment library where the CDK-active chemotypes are underrepresented by library design. The method did not fail; the library composition changed the retrieval probability.

What Enrichment Factor Actually Measures

EF at 1% fraction of the screened library is the most practically relevant metric because it corresponds roughly to the throughput of experimental follow-up: if you are willing to test 100-200 compounds from a 10,000-compound library, you want to know how concentrated the actives are in the top 1-2% of the ranked list. An EF1% of 10 against a library with a 0.5% base rate of actives means the top 1% of the list contains approximately 50 compounds, of which 5% might be confirmed actives in orthogonal assay — assuming the base rate and EF hold prospectively, which they may not.

The base rate assumption is the most dangerous assumption in enrichment projections. Published hit rates for fragment libraries against well-characterized binding sites vary widely: 1-5% for good-quality fragment sets against kinase hinge-binding — confirmed by SPR or thermal shift at standard screening conditions — down to 0.1-0.5% for the same library against GPCR or PPI targets where binding is inherently weaker and harder to confirm. Applying a kinase-calibrated EF1% to a GPCR campaign without adjusting for the lower base rate produces optimistic hit count projections that the experimental results will not support.

BEDROC (Boltzmann-Enhanced Discrimination of ROC) weights early recovery more heavily than late recovery, making it more sensitive to enrichment at the 0.1-1% fraction that matters for practical hit identification. BEDROC(α=80.5) corresponds roughly to a 1% early recognition fraction. It is more discriminating than AUC-ROC for evaluating prospective utility but less widely reported in the literature, which makes benchmark comparisons across publications harder.

Protocol Validation: What Numbers Are Realistic

Before running a prospective virtual screen, protocol validation on a target-specific benchmark is the correct standard. This requires: (1) a set of known actives for the target with confirmed binding, drawn from ChEMBL or internal data; (2) a property-matched decoy set (EasyDock or DUD-E generator tools); (3) running the docking protocol under prospective conditions (same receptor preparation, same docking parameters, same scoring function) on the validation set.

Target-specific validation typically produces AUC values 0.05-0.1 lower than the published benchmark values for the same method, because the target-specific actives are more chemically diverse than the curated benchmark actives and include more ambiguous binders. This calibration is valuable: it tells you what the method actually achieves on your target with your receptor structure, not what it achieves on a published benchmark against a different set of actives.

In our internal validation studies on kinase and GPCR targets, Glide SP with OPLS4 force field achieves EF1% values in the range of 5-12 on well-characterized kinase targets with ≥2.5 Å crystal structures, and 3-7 on GPCR targets where the structural quality is adequate for grid definition. AutoDock Vina with explicit receptor flexibility (flexible side-chain docking on key pocket residues) shows comparable or slightly lower EF1% on the same targets but substantially faster computation. These ranges are consistent with published benchmarks on similar target classes. We're not claiming these numbers as guarantees for any specific campaign — they are calibration ranges that inform expectation-setting, not performance warranties.

The Decoy Problem in Real Libraries

Real commercial screening libraries contain compound classes that are not well-represented in DUD-E or LIT-PCBA decoy sets. Pan-assay interference compounds (PAINS) — including thiol-reactive electrophiles, redox-cycling quinones, aggregators, and fluorescent compounds — can score well in docking simply because they are large, lipophilic, and fill hydrophobic pockets. Standard docking scoring functions do not penalize these compound classes; PAINS filters must be applied as a pre- or post-processing step. The REOS (Rapid Elimination of Swill) filter and the Walters-Murcko PAINS rules in RDKit provide a practical implementation.

Aggregators present a subtler problem. Compounds that form colloidal aggregates at typical screening concentrations show apparently broad-spectrum activity in biochemical assays and may produce satisfactory docking scores because of their lipophilicity. Aggregate inhibitors are not detected by docking and are not excluded by property filters (they pass Lipinski and Ro3). The only reliable filter is the detergent counter-screen (0.01% Triton X-100 or equivalent), which identifies compounds whose apparent IC50 shifts significantly with detergent — a classic aggregate inhibitor signature. This counter-screen is cheap and should be run on every confirmed hit from a docking-based virtual screen before structural follow-up is initiated.

We're not saying that PAINS and aggregators undermine virtual screening as a strategy. They are known failure modes with known counter-measures. The point is that enrichment rate metrics from published benchmarks are measured on curated active sets that are, by definition, genuine binders — the benchmark does not include aggregators or PAINS compounds in the active set. A prospective virtual screen against a real library will encounter these compound classes, and the apparent enrichment in the top-ranked list before orthogonal counter-screening will be lower than the benchmark numbers suggest.

Setting Realistic Expectations: A Pre-Campaign Checklist

Before a virtual screening campaign, the following parameters should be explicitly stated, not assumed:

Library composition and base rate estimate: What fraction of the library is fragment-sized (Ro3) versus drug-like (Ro5)? What fraction passes PAINS filters? Published base rates for the target class (kinase, GPCR, protease, PPI) define the denominator for enrichment projections.
Receptor structure quality: What is the resolution of the input crystal structure? Are there missing loops that affect pocket geometry? Has the protein been prepared at appropriate pH with correct protonation and disulfide bond states? PrepWizard protein preparation and PROPKA protonation are the standard preprocessing steps; skipping either degrades docking accuracy by 0.1-0.3 AUC in published comparisons.
Protocol validation EF1%: Run the docking protocol on a target-specific validation set before screening the full library. This calibrates the enrichment expectation on this specific target, not on the benchmark set.
Orthogonal confirmation plan: What is the capacity for experimental follow-up? SPR and thermal shift can confirm 200-500 compounds per week at modest throughput. The top 1% of a 10,000-compound screen is 100 compounds — fully tractable. The top 1% of a 200,000-compound screen is 2,000 compounds — not tractable without additional computational triage.

For how enrichment rate benchmarking feeds into our campaign design and hit triage decisions, see the Science page. For the structural screening platform that implements these workflows, see the Moleculepath Screening Platform. Related reading: Docking Scores vs FEP addresses how to triage the confirmed hit shortlist once enrichment-based virtual screening has produced it.