Docking Scores vs FEP: Choosing the Right Method for Early Hit Triage

Docking is fast; FEP is accurate. Neither is universally correct for hit triage. The choice depends on what you know about the target binding mode and how much structural uncertainty you can tolerate at the hit stage.

The Fundamental Trade-off

Docking and free energy perturbation (FEP) address the same underlying question — how strongly does this compound bind to this target? — using physics at different levels of approximation. Docking uses an empirical or force-field-derived scoring function to evaluate static or semi-flexible binding poses, typically in seconds per compound. FEP+ and related alchemical methods simulate the thermodynamic cycle of ligand transformation explicitly, sampling solvent structure, protein flexibility, and conformational entropy over nanosecond-scale MD trajectories. The accuracy difference is real: published FEP+ benchmark data on congeneric series (Wang et al., JCTC 2015; Cournia et al., JCIM 2020) shows MUE values of 0.8-1.0 kcal/mol for well-parameterized systems. Docking rank-ordering on the same sets typically shows correlation coefficients to experimental Ki in the range of 0.4-0.6 at best.

What is less often discussed is that this accuracy advantage is conditional. FEP+ performs within 1 kcal/mol when: the binding mode is known (from co-crystal or high-confidence docking), the structural change between perturbation endpoints involves a single substituent replacement with defined geometry, the protein conformation is well-sampled by the starting MD, and the force field parameters (OPLS4 or ff14SB) are appropriate for the chemotype. When any of these conditions are not met, FEP errors can exceed 2-3 kcal/mol — comparable to or worse than well-validated docking on the same system.

When Docking Is the Right Tool

In early hit identification — the phase most SBDD programs spend the most money on — docking is typically the correct starting method, not because it is more accurate but because the conditions for FEP accuracy are not yet met. The binding mode of the hit is unknown. The binding site may have conformational flexibility that the input structure does not capture. The hit set is structurally diverse, not a congeneric series. Running FEP on 1,000 diverse compounds from a virtual screen requires not only compute time but reliable force field parameterization for each chemotype — a non-trivial effort for structures containing exotic heterocycles, unusual charge distributions, or tautomeric ambiguity.

Glide SP docking can process a 100,000-compound library against a single target structure in a few hours on a modest CPU cluster. AutoDock Vina or GNINA can do the same in less time with GPU acceleration. The output is a ranked list with binding poses. For each pose, the question is not "is this ΔG value accurate?" — it isn't — but "does this pose make structural sense?" Pose quality review — visual inspection for productive H-bond contacts, absence of steric clashes, sensible placement of polar groups — is the human judgment layer that the scoring function cannot substitute for. Experienced computational chemists reviewing docking poses are more valuable than a refined scoring function applied blindly.

Docking is also the correct tool when the structural uncertainty is high. Homology models with less than 60% sequence identity to the template, cryo-EM structures resolved below 2.5 Å, or crystal structures with poorly defined loop regions all introduce coordinate uncertainty that is invisible to both docking and FEP. In these cases, the additional accuracy of FEP provides false precision — the noise floor imposed by coordinate uncertainty dominates over the signal from improved sampling.

When FEP Becomes Justified

FEP is justified when the following conditions converge: you have a confirmed hit with a co-crystal structure establishing binding mode, you are making series-specific substituent changes (halogenation, methyl group introduction, ring constraint), and the binding site has well-defined residue contacts without major loop flexibility. This situation describes the hit-to-lead and lead optimization phases, not the hit identification phase.

The classic use case is Matched Molecular Pair (MMP) analysis driving a tight congeneric series. Consider a CDK2 inhibitor series where the lead compound is co-crystallized at 1.8 Å resolution (PDB 1H1S and related depositions provide the structural context for this target class). FEP+ on a 20-compound matrix of methyl, fluoro, and chloro substitutions at three ring positions can predict the rank order of binding affinities with MUE ~0.8-1.0 kcal/mol, prioritizing which analogs to synthesize. This is where the compute investment pays back: synthetic resources are directed toward the highest-predicted binders, not spent uniformly across a full combinatorial matrix.

Lambda scheduling matters here. For well-converged FEP runs, 12-16 lambda windows with 2-5 ns per window per perturbation endpoint typically gives converged ΔΔG values for single heavy-atom changes. The Schrödinger FEP+ workflow with replica exchange solute tempering (REST2) improves convergence for systems with slow conformational dynamics at the binding site. RBFE (relative binding free energy) is far more tractable than ABFE (absolute) for series work — absolute FEP errors for binding free energies typically run 1-3 kcal/mol on well-behaved systems, making ABFE less useful for rank ordering than RBFE on congeneric substituents.

The Structural Confidence Requirement

There is a specific failure mode we observe repeatedly in programs that deploy FEP too early: the binding mode assumption is wrong. FEP calculates the perturbation free energy within the framework of a single binding mode. If the compound actually binds in two modes — a common situation for flexible, drug-like compounds in the 400-500 Da range — FEP gives a precise answer for the wrong binding mode, and the predicted ΔΔG values are meaningless regardless of simulation convergence.

We're not saying FEP is fragile — on the right systems, with well-established binding mode assignment, it is the most reliable affinity prediction method available to a medicinal chemistry program. The point is that the method's accuracy is gated by prior certainty about the binding mode, and that certainty is not free. It requires either a co-crystal structure of the hit itself or of a close analog with confirmed structural equivalence. Using docking pose prediction as a substitute for crystallographic binding mode confirmation before deploying FEP is a workflow sequencing error that is more common than it should be.

One scenario we see in practice: a fragment-to-lead program where fragments have been confirmed by crystallography, but the first growth vectors produce drug-like analogs in the 350-450 Da range that have not been crystallized. The temptation is to run FEP using the docked pose of the grown compound. This can work if the grown compound's pose is tightly constrained by the fragment binding mode — if the growth adds a substituent without perturbing the scaffold position. It is riskier when the growth event causes a conformational shift that changes which residue contacts dominate, something docking may or may not predict correctly depending on pocket flexibility.

Practical Decision Framework for Hit Triage

A decision framework that works in practice looks like this:

  • Virtual screening (thousands to hundreds of thousands of compounds): Glide SP or AutoDock Vina with pose quality review. No FEP. Throughput is the priority; accuracy ceiling is acceptable at this stage because the goal is enrichment, not precision ranking.
  • Hit confirmation and initial triage (dozens of confirmed binders): Glide XP re-docking of confirmed binders with explicit pose review. MM-GBSA rescoring for top poses provides a correction for desolvation that improves rank ordering compared to raw GScore without the computational cost of FEP.
  • Series-stage hit-to-lead (co-crystal confirmed, congeneric analogs): FEP+ on substituent matrix with ≥12 lambda windows, convergence monitoring by hysteresis check (forward and reverse perturbation agreement). Apply only to analogs within the established binding mode frame.
  • Lead optimization: FEP+ with extended sampling for challenging substituents; RBFE with REST2 for systems with identified conformational flexibility hotspots (e.g., P-loop kinases, flexible glycine-rich loops).

The decision to escalate to FEP should be triggered by crystallographic evidence, not by wanting more accuracy than docking provides at a stage where that accuracy is not recoverable from the structural data available. Spending FEP compute budget on a hit whose binding mode has not been crystallographically confirmed is a common resource misallocation — and in a three-month hit-to-lead timeline, the weeks spent running and troubleshooting FEP on an unconfirmed binding mode have a direct opportunity cost in SAR cycles not completed.

Force Field and Parameterization Considerations

When FEP is appropriate, force field choice is non-trivial. OPLS4 (Schrödinger) shows improved performance over OPLS3e for charged and heteroaromatic systems; it is currently the best-validated option for drug-like small molecules in the FEP+ framework. GAFF2 (AmberTools) is more widely accessible and adequate for standard organic chemotypes but shows larger errors on sulfones, phosphonates, and boron-containing fragments. For GROMACS-based FEP workflows using free software, combining GAFF2 with TIP3P-FB solvent model and AMBER14SB protein force field is a reasonable starting point with published benchmark validation.

Water placement at the binding site is underappreciated as a FEP accuracy factor. Displaced waters contribute significantly to absolute and relative binding free energies; WaterMap-style analysis or GRAND canonical Monte Carlo sampling of binding site waters before FEP setup meaningfully improves predictions for systems with ordered water networks (serine protease subsites, kinase DFG loop regions, GPCR transmembrane binding sites). Running FEP without accounting for structural waters in the pocket is a systematic error that will propagate into the ΔΔG predictions.

For programs in early hit-to-lead, our practice is to run MM-GBSA rescoring as a bridge: faster than FEP, more physically informed than raw docking scores, and useful for eliminating the bottom half of a ranked shortlist before committing synthesis resources to the full set. See the methodology details on the Science page, and the discussion of how these choices are built into our Screening Platform.

Related reading: Benchmarking Virtual Screening: What Enrichment Rates Actually Tell You covers how to set realistic expectations for docking-based enrichment before a campaign starts.