Glide Docking and Scoring: SP, XP, and When Each Mode Matters

Visualization of Glide docking scoring modes SP and XP

Most computational chemists working in early drug discovery have a rough sense that Glide SP is faster and XP is more accurate. In our experience running structure-based campaigns against a range of targets, that framing holds up, but it leaves out the details that actually matter when you're deciding how to allocate compute and which score to trust when the two modes disagree. This post walks through how Glide's three scoring modes work, what GlideScore is actually measuring, and where each mode tends to win or lose in practice.

Three Modes, One Engine

Glide ships three docking configurations: High-Throughput Prescreening (HTVS), Standard Precision (SP), and Extra Precision (XP). They share the same underlying geometry, but differ substantially in how thoroughly they sample conformational space and how penalized the scoring function is.

HTVS is the library-reduction stage. It runs a single, coarse docking pass per compound, trading accuracy for throughput. On a modern GPU node you can screen one million SMILES in a few hours with HTVS. The false-negative rate is meaningful, typically 15 to 25% of actual binders get filtered here, which is why HTVS output should never go directly to a synthesis decision.

SP sits in the middle. It explores multiple ring conformations and flexible side-chain orientations, applies a refined GlideScore function, and returns reliable binding poses for the top scorers. Against a well-prepared, experimental crystal structure, SP enrichment factors at 1% of the screened library commonly fall in the range of 15 to 40-fold over random selection, depending on target class. For kinases and GPCRs with well-defined binding sites, we tend to see the higher end of that range. For targets with shallow or highly flexible binding pockets, SP enrichment drops sharply.

XP is where you commit compute. It samples more thoroughly, applies a more penalized desolvation model, and explicitly penalizes poses that bury hydrophilic groups without making compensating hydrogen bonds. XP runs roughly 4 to 8 times slower than SP per compound, so running it on a full 5-million-compound library is rarely practical. The typical workflow is to pass SP's top 5% of scorers to XP for re-scoring, then re-rank the combined output.

What GlideScore Actually Measures

GlideScore is an empirically parameterized free-energy surrogate. Its main terms break down roughly as follows:

Term What it captures Notes
Coulomb + vdW Electrostatic and steric shape complementarity Dominant term for most scaffolds
Hydrogen bonding Directional H-bond satisfaction with receptor residues XP applies stricter angular cutoffs than SP
Desolvation penalty Cost of displacing water from binding site and burying polar groups XP desolvation is more penalized; main differentiator
Lipophilic contact Buried hydrophobic surface area Rewards lipophilic fit in hydrophobic pockets
Penalty terms Exposed unsatisfied H-bond donors, strained torsions XP-only penalties catch scoring artifacts that SP misses

GlideScore is negative when favorable, so a score of -9.5 is better than -6.0. The absolute values are not binding affinities. In our data, GlideScore correlates meaningfully with pIC50 within a congeneric series, but cross-scaffold correlation is poor. Compounds from different scaffolds with identical GlideScores regularly show 100-fold differences in measured potency. Treat it as a ranking tool within a run, not a quantitative predictor. Not an assay result.

When SP Is the Right Call

Use SP as your main campaign mode when you need broad coverage of a large and diverse library. A few specific situations where SP is appropriate:

  • Initial campaign against a target with no prior hit matter. You want coverage, not precision.
  • Library is structurally diverse, like a commercial fragment set or a diverse HTS collection with over 500,000 compounds. XP's extra accuracy on this kind of library is not worth the 6-fold compute overhead.
  • Binding site has a clear deep pocket with strong shape selectivity. SP scoring captures the key geometric constraints well for well-defined sites.
  • You have a crystal structure with a known binder, and you're running a re-docking validation first. SP pose accuracy for co-crystallized ligands routinely hits RMSD under 2.0 angstroms against the experimental pose on well-prepared structures.

One pitfall: SP is more permissive about desolvation. Hydrophilic compounds that cannot make compensating hydrogen bonds will score too optimistically under SP. Fact: in our tracking across multiple campaigns, roughly 20 to 30% of SP top-100 compounds show this artifact when re-scored by XP. That number is not small. Plan for it.

When XP Earns Its Compute Cost

XP becomes important when the biological target has a water-mediated binding mechanism, when you're working with a literature series and want to validate whether your model reproduces the known SAR, or when your downstream synthesis queue is short and a false positive costs more than the extra compute.

XP's penalized desolvation model catches a pattern SP commonly misses: the high-scoring amphiphilic compound that buries a polar head group in a lipophilic pocket without any compensating interaction. These compounds look like hits on SP but fail almost universally in biochemical assay. We've seen this pattern repeatedly with kinase hinge binders that have a pendant carboxylic acid pointing into the ATP-binding hydrophobic subsite.

The practical XP workflow for a 5-million-compound run: HTVS to reduce to roughly 500,000 compounds (90% attrition is normal), SP on those 500,000 to identify the top 25,000 scorers, then XP re-scoring on the top 25,000 for final short-list generation. The SP-to-XP transition adds around four to six hours of GPU compute, which is a reasonable cost for the false-positive reduction it delivers.

Typical Enrichment Factors: Realistic Expectations

Enrichment factor calculations are often presented optimistically in methods papers, because the benchmark sets skew toward well-validated targets with deep, rigid pockets. In our experience with real campaign data, here is what we observe at 1% of the screened library:

  • SP on kinase targets with co-crystal structure: EF typically 20 to 45-fold over random
  • SP on nuclear receptors (e.g., PPAR, RXR): EF typically 10 to 25-fold
  • SP on challenging targets (allosteric sites, flexible loops, shallow pockets): EF 3 to 8-fold, sometimes indistinguishable from random
  • XP re-scoring on top SP hits, same kinase targets: EF gains of 1.5 to 2.5x over SP alone

The enrichment factor framing is worth pausing on. Honest: a 20-fold enrichment at 1% of the library still means 80% of your short-list will not confirm in assay. Every time. The pipeline's job is to reduce synthesis volume, not guarantee hit confirmation. Setting that expectation clearly with your CRO partner or internal wet-lab team before results are delivered prevents a lot of the friction we see during hit-identification handoffs.

Practical Workflow Pitfalls

A few patterns that cause problems repeatedly:

Skipping receptor preparation. GlideScore is extremely sensitive to protonation state assignments and water placement. Running Glide on a raw PDB file without running Protein Preparation Wizard or an equivalent protocol is the single most common error we see in campaigns from teams new to the platform. Consistently. An unprepared receptor can shift docking scores by 2 to 4 units across the board, making the entire ranking unreliable.

Using GlideScore as a cross-scaffold ranking tool. Already covered, but it bears repeating because it causes downstream synthesis failures. When you have two distinct chemotypes both sitting at GlideScore -8.5, do not assume they are equivalent hits. Apply your ML re-ranker, inspect the poses manually, and let ADMET flags break the tie.

Overloading XP with a full-diversity library. Running XP directly on a 2-million-compound corporate collection is possible but almost never the right allocation of compute. The HTVS-SP-XP cascade exists for a reason. The attrition at each stage is normal and expected.

Ignoring strain energy in the final short-list. GlideScore does not fully account for ligand strain energy in all pose configurations. Compounds with unusual bond angles or locked ring systems can score well geometrically while adopting conformations that are physically implausible in solution. Running a ligand strain energy post-filter on the final short-list eliminates this category of artifact before compounds go to synthesis feasibility review.

Practical note: run a re-docking validation on any known actives you have before interpreting campaign scores. If your protocol cannot recover the co-crystal pose of a literature binder within 2.0 angstrom RMSD, something in your receptor preparation is off. Fix it before screening the full library.

Putting It Together for a Seed Biotech Campaign

For a seed-stage biotech running its first structure-based campaign, the most common mistake is treating Glide as a black box that returns a ranked list. It is not. The list is only as reliable as the receptor preparation, the grid definition, and the score interpretation framework applied downstream. The mode choice, HTVS versus SP versus XP, is a secondary decision.

Start with a co-crystallized reference compound for pose validation. Set the RMSD threshold at 2.0 angstroms before you believe any SP or XP scores. If you are working from an AlphaFold2 model rather than a crystal structure, expect lower enrichment factors and consider running a more aggressive ADMET pre-filter to compensate for the higher rate of false positives from less certain binding-site geometry.

The cascade workflow, HTVS at 5 million, SP on 10%, XP re-score on the top 0.5%, ML re-ranking on the top 50 to 200, delivers a short-list you can defend at a project team meeting. It also produces the kind of structured docking data, complete with score distributions, pose images, and attrition statistics, that communicates clearly to wet-lab colleagues who did not sit through the screening run.

We have found that the biggest value of a well-parameterized Glide campaign is not the top-ranked compound. It is the annotated short-list with enough context for the medicinal chemist to make a confident synthesis decision. That is the goal the protocol should be designed around.

Moleculepath runs Glide SP and XP campaigns as part of its multi-stage virtual screening pipeline. If you want to see how the cascade applies to your target, contact us to discuss your program.