iotbx.pdb tutorial: pdb_truncate_to_ala
¶
This tutorial was written by Ralf Grosse-Kunstleve for the 2008
SBGrid meeting. It is designed for novices to
CCTBX (or programming in general), but gives a good practical overview of how
the iotbx.pdb
facilities may be used.
Overview¶
The pdb_truncate_to_ala
tutorial shows how to use the iotbx.pdb
module to truncate the amino-acid residues in a PDB file to alanine
(e.g. in preparation for molecular replacement). The tutorial consists
of five scripts with increasing sophistication:
- v0_getting_started.py
Reads a PDB file and shows some information about the content.
- v1_loop_over_atoms.py
Also loops over the atoms in the PDB file and prints residue information and atom names.
- v2_simple.py
Simple version removing amino-acid side chain atoms, except C-beta.
- v3_better.py
Improved removal of side chain atoms that works even if there are alternative conformations with a mix of residue names.
- v4_with_bells_and_whistles.py
Additional bookkeeping for a few informative print statements.
The full source listing for these scripts is shown at the end of this document.
Use print and help()!¶
Before you study any of the scripts below, install the CCTBX.
This will enable you to run the scripts. Use the
PDB files in the `Tutorial directory`_ as inputs, or any other PDB file
you may have. While analyzing a script, insert print
statements
and run the script to find out more about the objects. It may also
be useful to insert help(obj)
to see the attributes and methods of
obj
, where obj
can be any of the objects created in the script.
If you don’t know what an object is: thing is a pretty good approximation.
v0_getting_started.py
¶
The iotbx.pdb
module implements highly efficient procedures for:
Processing the records (i.e. lines) in a PDB file.
Constructing a hierarchy object. This is a five-deep nested data structure:
model chain residue_group atom_group atom
v0_getting_started.py
is a complete Python script for executing
these steps:
import iotbx.pdb
import sys
def run(args):
if (len(args) == 0):
raise RuntimeError("Please specify one or more pdb file names.")
for file_name in args:
pdb_obj = iotbx.pdb.input(file_name=file_name)
pdb_obj.construct_hierarchy().overall_counts().show()
if (__name__ == "__main__"):
run(sys.argv[1:])
Most of this script is so called boilerplate code, i.e. code that in
some form or shape is found in most Python scripts. At the beginning of
the script are import
statements. These import the modules needed
for a task. The bottom lines represent best practice. They enable
the script to be imported and used from other Python scripts. The
first two lines of the run()
function are a minimalistic - but
often sufficient - way to give users a hint how to use the script. It
works both for someone reading the source code of the script, and a
user running the script without arguments. For example:
% iotbx.python v0_getting_started.py
Traceback (most recent call last):
File "v0_getting_started.py", line 12, in <module>
run(sys.argv[1:])
File "v0_getting_started.py", line 6, in run
raise RuntimeError("Please specify one or more pdb file names.")
RuntimeError: Please specify one or more pdb file names.
In addition to showing the error message, Python is so friendly show us exactly where the error originates. This is often extremely helpful.
The meat of the script is in these two lines:
pdb_obj = iotbx.pdb.input(file_name=file_name)
pdb_obj.construct_hierarchy().overall_counts().show()
The first line executes the two steps outline above. This is really all we need, but the second line produces output that is useful to give the user a quick overview of what is in the PDB file:
% iotbx.python v0_getting_started.py crambin_pieces.pdb
total number of:
models: 1
chains: 2
alt. conf.: 4
residues: 3
atoms: 23
anisou: 0
number of atom element+charge types: 3
histogram of atom element+charge frequency:
" C " 16
" O " 5
" N " 2
residue name classes:
"common_amino_acid" 2
"other" 1
number of chain ids: 2
histogram of chain id frequency:
" " 1
"A" 1
number of alt. conf. ids: 2
histogram of alt. conf. id frequency:
"A" 2
"B" 2
residue alt. conf. situations:
pure main conf.: 1
pure alt. conf.: 1
proper alt. conf.: 1
improper alt. conf.: 0
chains with mix of proper and improper alt. conf.: 0
number of residue names: 3
histogram of residue name frequency:
"EOH" 1 other
"ILE" 1
"SER" 1
v1_loop_over_atoms.py
¶
The model
, chain
, and atom
levels of the hierarchy
object are probably immediately obvious to someone familiar with
the content of PDB files. The residue_group
and atom_group
levels are more complex. This complexity is related to alternative
conformations. If there are no alternative conformations in
a PDB file, all residue groups contain exactly one atom group,
which contains all the atoms of a residue. A file with alternative
conformations will lead to residue groups with multiple atom groups,
one for each conformer. The crambin_pieces.pdb file used above
is a file with alternative conformations (about 24% of the files
in the PDB database contain alternative conformations).
To truncate amino-acid residues to alanine, we need to know which residues are amino-acids, and the atom names. A more detailed presentation of the hierarchy object shows where we can find this information:
model
id
chain(s)
id
residue_group(s)
resid
atom_group(s)
altloc, resname
atom(s)
name
segid
element
charge
serial
xyz sigxyz
occ sigocc
b sigb
uij siguij
hetero
We don’t need all this information for the truncation procedure, just this subset of the hierarchy:
model
chain(s)
residue_group(s)
resid
atom_group(s)
altloc, resname
atom(s)
name
This presentation translates directly into Python code, as found in v1_loop_over_atoms.py:
from __future__ import division
import iotbx.pdb
import sys
def run(args):
if (len(args) == 0):
raise RuntimeError("Please specify one or more pdb file names.")
for file_name in args:
pdb_obj = iotbx.pdb.input(file_name=file_name)
hierarchy = pdb_obj.construct_hierarchy()
hierarchy.overall_counts().show()
for model in hierarchy.models():
for chain in model.chains():
for rg in chain.residue_groups():
print 'resid: "%s"' % rg.resid()
for ag in rg.atom_groups():
print ' altloc: "%s", resname: "%s"' % (ag.altloc, ag.resname)
for atom in ag.atoms():
print ' ', atom.name
if (__name__ == "__main__"):
run(sys.argv[1:])
The script contains interleaved print statements. The output of:
% iotbx.python v1_loop_over_atoms.py crambin_pieces.pdb
can be found here: http://cctbx.sourceforge.net/sbgrid2008/v1_loop_over_atoms_crambin_pieces.out
v2_simple.py
¶
We have residue names and atom names now, but we still need the information to decide what residues are amino acids, and what atom names we want to keep.
The iotbx.pdb
module contains a sub-module amino_acid_codes
.
This sub-module contains two Python dictionaries, one of which is
(shortened):
one_letter_given_three_letter = {
"ALA": "A",
"ARG": "R",
...
"TYR": "Y",
"VAL": "V"}
We don’t need the one-letter codes, but we can re-use the keys of this dictionary to efficently decide if a residue name corresponds to an amino acid. The relevant lines in v2_simple.py are:
import iotbx.pdb.amino_acid_codes
...
aa_resnames = iotbx.pdb.amino_acid_codes.one_letter_given_three_letter
...
if (ag.resname in aa_resnames):
For the atom names, we use a Python set
. The relevant lines
in v2_simple.py are:
ala_atom_names = set([" N ", " CA ", " C ", " O ", " CB "])
...
if (atom.name not in ala_atom_names):
We use a Python set
because it uses hashing techniques for element
lookup when processing the in
in the if
statement. For a
small list like here it doesn’t really matter, but in Python it is
so easy to use advanced hashing techniques, simply by converting the
list
of atom names to a set
, there is no reason not to take
advantage of them.
Now that we know which residues we want to truncate, and which atom names we want to keep, we just need one more line to remove the side chain atoms:
ag.remove_atom(atom=atom)
This removes the atom from the atom group. The only thing left to do once the nested loops over the hierarchy are finished, is to write the modified hierarchy to a file:
output_pdb = "v2_truncated_to_ala_"+file_name
pdb_obj.hierarchy.write_pdb_file(file_name=output_pdb)
The first line builds the output file name by concatenating two
strings with the +
operator. In the second line the
.write_pdb_file()
method of the hierarchy object is used
to write the file to disk.
The complete source for this script:
from __future__ import division
import iotbx.pdb
import iotbx.pdb.amino_acid_codes
import sys
def run(args):
if (len(args) == 0):
raise RuntimeError("Please specify one or more pdb file names.")
aa_resnames = iotbx.pdb.amino_acid_codes.one_letter_given_three_letter
ala_atom_names = set([" N ", " CA ", " C ", " O ", " CB "])
for file_name in args:
pdb_obj = iotbx.pdb.input(file_name=file_name)
hierarchy = pdb_obj.construct_hierarchy()
hierarchy.overall_counts().show()
for model in hierarchy.models():
for chain in model.chains():
for rg in chain.residue_groups():
for ag in rg.atom_groups():
if (ag.resname in aa_resnames):
for atom in ag.atoms():
if (atom.name not in ala_atom_names):
ag.remove_atom(atom=atom)
output_pdb = "v2_truncated_to_ala_"+file_name
hierarchy.write_pdb_file(file_name=output_pdb)
if (__name__ == "__main__"):
run(sys.argv[1:])
The output PDB file of:
% iotbx.python v2_simple.py crambin_pieces.pdb
can be found here: http://cctbx.sourceforge.net/sbgrid2008/v2_truncated_to_ala_crambin_pieces.pdb
v3_better.py
¶
For most practical purposes, the v2_simple.py
script is completely
sufficient. However, there are currently 16 files in the PDB (of 50623
total, as of Apr 30, 2008) for which this is not true. One example is
the structure with the PDB ID 1ysl
. The file resname_mix.pdb
contains the problematic residue:
HETATM 3907 N ACSD B 111 25.006 36.731 16.222 0.50 18.83 N
HETATM 3908 CA ACSD B 111 25.536 35.903 15.152 0.50 19.90 C
HETATM 3909 CB ACSD B 111 25.931 36.658 13.876 0.50 21.09 C
HETATM 3910 SG ACSD B 111 25.414 38.295 13.671 0.50 26.29 S
HETATM 3911 C ACSD B 111 26.713 35.054 15.562 0.50 19.23 C
HETATM 3912 O ACSD B 111 27.472 34.533 14.697 0.50 20.10 O
HETATM 3913 OD1ACSD B 111 23.793 38.008 13.181 0.50 30.17 O
HETATM 3914 OD2ACSD B 111 25.111 39.102 15.048 0.50 26.06 O
ATOM 3915 N BCYS B 111 24.996 36.697 16.246 0.50 13.39 N
ATOM 3916 CA BCYS B 111 25.522 35.913 15.123 0.50 16.53 C
ATOM 3917 C BCYS B 111 26.790 35.104 15.498 0.50 15.20 C
ATOM 3918 O BCYS B 111 27.342 34.391 14.660 0.50 16.26 O
ATOM 3919 CB BCYS B 111 25.840 36.879 13.947 0.50 20.05 C
ATOM 3920 SG BCYS B 111 24.645 38.257 14.039 0.50 29.86 S
Rare cases like this are the very reason why we need the
residue_group
and atom_group
levels in the hierarchy. Here
we have two different residue names for the same member of a
chain. Even though this sitution is rare (there are only 37 additional
non-amino-acid instances in the PDB), they are entirely plausible and
valid, and a comprehensive PDB processing library has to be able to
handle them.
The v2_simple.py
script will only truncate the CYS
residue
above:
% iotbx.python v2_simple.py resname_mix.pdb
Output: http://cctbx.sourceforge.net/sbgrid2008/v2_truncated_to_ala_resname_mix.pdb
It would be better if it also truncated the non-standard CSD
residue in the A
alternative conformation. Let’s find out what
it takes to achieve this.
The basic idea is to check if there is at least one amino acid in a residue group, and if so, apply the truncation to all residues in the group, even if they don’t have a standard residue name. This means, for each residue group we have to loop over the atom groups twice, first to scan for at least one standard amino-acid residue name, and if there is one, a second time to do the truncation. The bad news is, to achieve this, we have to double our effort. The good news is, the extra effort is only five lines.
This is the part of v2_simple.py
we have to work on:
for ag in rg.atom_groups():
if (ag.resname in aa_resnames):
for atom in ag.atoms():
if (atom.name not in ala_atom_names):
ag.remove_atom(atom=atom)
The extra effort goes into finding out if there is at least one amino acid in the residue group:
def have_amino_acid():
for ag in rg.atom_groups():
if (ag.resname in aa_resnames):
return True
return False
Once that’s settled, we can move the
if (ag.resname in aa_resnames)
test outside the loop
over the atom groups and replace it with if (have_amino_acid())
.
The rest of the v2 code is unchanged:
if (have_amino_acid()):
for ag in rg.atom_groups():
for atom in ag.atoms():
if (atom.name not in ala_atom_names):
ag.remove_atom(atom=atom)
Note that have_amino_acid()
is a nested function. Nested functions
are often very useful to centralize small sub-tasks without taking
them completely out of context. Since our nested function is aware
of the context, we don’t need to pass any arguments.
The complete source for this script:
from __future__ import division
import iotbx.pdb
import iotbx.pdb.amino_acid_codes
import sys
def run(args):
if (len(args) == 0):
raise RuntimeError("Please specify one or more pdb file names.")
aa_resnames = iotbx.pdb.amino_acid_codes.one_letter_given_three_letter
ala_atom_names = set([" N ", " CA ", " C ", " O ", " CB "])
for file_name in args:
pdb_obj = iotbx.pdb.input(file_name=file_name)
hierarchy = pdb_obj.construct_hierarchy()
hierarchy.overall_counts().show()
for model in hierarchy.models():
for chain in model.chains():
for rg in chain.residue_groups():
def have_amino_acid():
for ag in rg.atom_groups():
if (ag.resname in aa_resnames):
return True
return False
if (have_amino_acid()):
for ag in rg.atom_groups():
for atom in ag.atoms():
if (atom.name not in ala_atom_names):
ag.remove_atom(atom=atom)
output_pdb = "v3_truncated_to_ala_"+file_name
hierarchy.write_pdb_file(file_name=output_pdb)
if (__name__ == "__main__"):
run(sys.argv[1:])
The output PDB file of:
% iotbx.python v3_better.py resname_mix.pdb
can be found here: http://cctbx.sourceforge.net/sbgrid2008/v3_truncated_to_ala_resname_mix.pdb
v4_with_bells_and_whistles.py
¶
The v3_better.py
script does a comprehensive job, but it
doesn’t tell the user anything about the manipulations. It would
be interesting to know how many atoms were deleted, how many
residue are affected, and how many residue are unchanged. The
v4_with_bells_and_whistles.py
script produces this information.
To get the desired counts, we need counters, and we need to initialize them before we enter the nested loops over the hierarchy:
n_amino_acid_residues = 0
n_other_residues = 0
n_atoms_removed = 0
Here n
is a shorthand for number of.
Inside the loop over the residue groups, we keep track of the
amino-acid counts:
if (not have_amino_acid()):
n_other_residues += 1
else:
n_amino_acid_residues += 1
And instead of just removing the atoms, we remove and count:
ag.remove_atom(atom=atom)
n_atoms_removed += 1
When the loops over the hierarchy are finished, we print the counts:
print "Number of amino acid residues:", n_amino_acid_residues
print "Number of other residues:", n_other_residues
print "Number of atoms removed:", n_atoms_removed
Since we can now easily find out if no atoms were removed (e.g. because someone passed in a DNA model), we should take advantage of it and write the output PDB file only if there are changes:
if (n_atoms_removed != 0):
output_pdb = "v4_truncated_to_ala_"+os.path.basename(file_name)
if (output_pdb.endswith(".gz")): output_pdb = output_pdb[:-3]
print "Writing file:", output_pdb
pdb_obj.hierarchy.write_pdb_file(
file_name=output_pdb,
crystal_symmetry=pdb_obj.input.crystal_symmetry(),
append_end=True)
There are three more small enhancements compared to the
v3_better.py
script:
os.path.basename()
is used to remove any directory name component from the input file name, if present. With this, it is certain that the output file is written in the current working directory, not the directory of the input file (which may be in another user’s directory or a system directory)
iotbx.pdb.input
is able to open .gz files directly (e.g. compressed files as downloaded from the PDB). However, the.write_pdb_file()
method always writes plain (non-compressed) files. Therefore the.gz
extension has to be removed, if present. The string method.endswith()
is used to detect the extension, and string slicing ([:-3]
) to remove it.The input crystal symmetry information (unit cell and space group) is passed to the
.write_pdb_file()
method, which then writes CRYST1 and SCALE records to the output file. In addtion, the optionalappend_end
argument is used to request that an END record is written at the end of the output file.
Full source:
from __future__ import division
import iotbx.pdb
import iotbx.pdb.amino_acid_codes
import sys, os
def run(args):
if (len(args) == 0):
raise RuntimeError("Please specify one or more pdb file names.")
aa_resnames = iotbx.pdb.amino_acid_codes.one_letter_given_three_letter
ala_atom_names = set([" N ", " CA ", " C ", " O ", " CB "])
for file_name in args:
pdb_obj = iotbx.pdb.input(file_name=file_name)
hierarchy = pdb_obj.construct_hierarchy()
hierarchy.overall_counts().show()
n_amino_acid_residues = 0
n_other_residues = 0
n_atoms_removed = 0
for model in hierarchy.models():
for chain in model.chains():
for rg in chain.residue_groups():
def have_amino_acid():
for ag in rg.atom_groups():
if (ag.resname in aa_resnames):
return True
return False
if (not have_amino_acid()):
n_other_residues += 1
else:
n_amino_acid_residues += 1
for ag in rg.atom_groups():
for atom in ag.atoms():
if (atom.name not in ala_atom_names):
ag.remove_atom(atom=atom)
n_atoms_removed += 1
print "Number of amino acid residues:", n_amino_acid_residues
print "Number of other residues:", n_other_residues
print "Number of atoms removed:", n_atoms_removed
if (n_atoms_removed != 0):
output_pdb = "v4_truncated_to_ala_"+os.path.basename(file_name)
if (output_pdb.endswith(".gz")): output_pdb = output_pdb[:-3]
print "Writing file:", output_pdb
hierarchy.write_pdb_file(
file_name=output_pdb,
crystal_symmetry=pdb_obj.crystal_symmetry(),
append_end=True)
print
if (__name__ == "__main__"):
run(sys.argv[1:])