xfel.clustering package¶
xfel.clustering.cluster module¶
This module is designed to provide tools to deal with groups of serial crystallography images.
The class Cluster allows the creation, storage and manipulation of these sets of frames. Methods exist to create sub-clusters (new cluster objects) or to act on an existing cluster, e.g. to plot the unit cell distributions.
Author: Oliver Zeldin <zeldin@stanford.edu>
cluster.Cluster¶
- class xfel.clustering.cluster.Cluster(data, cname='cluster', info='')¶
Bases:
object
Class for operation on groups of single XFEL images (here described by SingleFrame objects) as cluster objects.
Objects can be created directily using the __init__ method, or by using a classmethod e.g. to create an object from a folder full of integration pickles. Whenever a method can plot, there is the option of passing it an appropriate number of matplotlib axes objects, which will then get returned for use in composite plots. See cluster.42 for an example. If no axes are passed, the methods will just plot the result to the screen. Clustering filters can act on these to break them up into cluster objects with different members. A ‘filter’ is just a clustering procedure that puts the passes and fails into different clusters. This is acheived through the make_sub_cluster() method. This also keeps track of a sub-clusters heritage through the .info string, which is appended to. The idea is to be able to write filter scripts for each data. e.g
>>> test_cluster = Cluster.from_directories(["~/test_data"], ... 'test_script') >>> P3_only = test_cluster.point_group_filer('P3') >>> sub_clusters = P3_only.ab_cluster(1200) >>> big_cluster = max(sub_clusters, key=lambda x: len(x.members)) >>> best_data = big_cluster.total_intensity_filter(res=6.5, ... completeness_threshold=0.1, ... plot=False) >>> print best_data.info
Subsequent postrefinenment/merging programs can be called on an output cluster.lst file:
>>> prime.postrefine params.phil $(cat cluster.lst)
- __dict__ = mappingproxy({'__module__': 'xfel.clustering.cluster', '__doc__': 'Class for operation on groups of single XFEL images (here described by\n SingleFrame objects) as cluster objects.\n\n Objects can be created directily using the __init__ method, or by using a\n classmethod e.g. to create an object from a folder full of integration\n pickles. Whenever a method can plot, there is the option of\n passing it an appropriate number of matplotlib axes objects, which will then\n get returned for use in composite plots. See cluster.42 for an example.\n If no axes are passed, the methods will just plot the result to the screen.\n Clustering filters can act on these to break them up into cluster objects with\n different members. A \'filter\' is just a clustering procedure that puts the\n passes and fails into different clusters. This is acheived through the\n make_sub_cluster() method. This also keeps track of a sub-clusters heritage\n through the .info string, which is appended to. The idea is to be able to\n write filter scripts for each data.\n e.g ::\n\n >>> test_cluster = Cluster.from_directories(["~/test_data"],\n ... \'test_script\')\n >>> P3_only = test_cluster.point_group_filer(\'P3\')\n >>> sub_clusters = P3_only.ab_cluster(1200)\n >>> big_cluster = max(sub_clusters, key=lambda x: len(x.members))\n >>> best_data = big_cluster.total_intensity_filter(res=6.5,\n ... completeness_threshold=0.1,\n ... plot=False)\n >>> print best_data.info\n\n Subsequent postrefinenment/merging programs can be called on an output\n cluster.lst file::\n >>> prime.postrefine params.phil $(cat cluster.lst)\n ', '__init__': <function Cluster.__init__>, 'from_directories': <classmethod(<function Cluster.from_directories>)>, 'from_crystal_symmetries': <classmethod(<function Cluster.from_crystal_symmetries>)>, 'from_list': <classmethod(<function Cluster.from_list>)>, 'from_iterable': <classmethod(<function Cluster.from_iterable>)>, 'from_files': <classmethod(<function Cluster.from_files>)>, 'from_expts': <classmethod(<function Cluster.from_expts>)>, 'make_sub_cluster': <function Cluster.make_sub_cluster>, 'best_by_CC': <function Cluster.best_by_CC>, 'print_ucs': <function Cluster.print_ucs>, 'point_group_filter': <function Cluster.point_group_filter>, 'total_intensity_filter': <function Cluster.total_intensity_filter>, 'ab_cluster': <function Cluster.ab_cluster>, 'dump_file_list': <function Cluster.dump_file_list>, 'visualise_orientational_distribution': <function Cluster.visualise_orientational_distribution>, 'intensity_statistics': <function Cluster.intensity_statistics>, 'all_frames_intensity_stats': <function Cluster.all_frames_intensity_stats>, 'merge_dict': <function Cluster.merge_dict>, '__len__': <function Cluster.__len__>, 'dump_as_mtz': <function Cluster.dump_as_mtz>, 'to_pandas': <function Cluster.to_pandas>, 'uc_feature_vector': <function Cluster.uc_feature_vector>, 'prime_postrefine': <function Cluster.prime_postrefine>, '__dict__': <attribute '__dict__' of 'Cluster' objects>, '__weakref__': <attribute '__weakref__' of 'Cluster' objects>, '__annotations__': {}})¶
- __init__(data, cname='cluster', info='')¶
Builds a cluster from a list of SingleFrame objects, as well as information about these as a cluster (e.g. mean unit cell).
- Parameters:
data – a list of SingleFrame objects
cname – the name of the cluster, as a string.
info – an info-string for the cluster.
- __len__()¶
Number of images in the cluster
- __module__ = 'xfel.clustering.cluster'¶
- __weakref__¶
list of weak references to the object
- ab_cluster(threshold=10000, method='distance', linkage_method='single', log=False, ax=None, write_file_lists=True, schnell=False, doplot=True, labels='default')¶
Hierarchical clustering using the unit cell dimentions.
- Parameters:
threshold – the threshold to use for prunning the tree into clusters.
method – which clustering method from scipy to use when creating the tree (see scipy.cluster.hierarchy)
linkage_method – which linkage method from scipy to use when creating the linkages. x (see scipy.cluster.hierarchy)
log – if True, use log scale on y axis.
ax – if a matplotlib axes object is provided, plot to this. Otherwise, create a new axes object and display on screen.
write_file_lists – if True, write out the files that make up each cluster.
schnell – if True, use simple euclidian distance, otherwise, use Andrews-Berstein distance from Andrews & Bernstein J Appl Cryst 47:346 (2014) on the Niggli cells.
doplot – Boolean flag for if the plotting should be done at all.
Runs faster if switched off. :param labels: ‘default’ will not display any labels for more than 100 images, but will display file names for fewer. This can be manually overidden with a boolean flag. :return: A list of Clusters ordered by largest Cluster to smallest
Note
Use ‘schnell’ option with caution, since it can cause strange behaviour around symmetry boundaries.
- all_frames_intensity_stats(ax=None, smoothing_width=2000)¶
Goes through all frames in the cluster, and plots all the partial intensites. Then does a linear fit and rolling average on these.
- Parameters:
smoothing_width – the width of the smoothing window.
ax – Optional matplotlib axes object to plot to. Otherwise, plot to screen.
- Returns:
the axis, with the data plotted onto it.
- best_by_CC(other, assert_is_similar_symmetry=False)¶
Return the SingleFrame object with the highest CC to a reference miller array. :param other: miller array object to be correlated against :return: a SingleFrame object.
- dump_as_mtz(mtz_name, use_fullies=False)¶
Merge using weighted mean and standard deviation if all miller arrays
- dump_file_list(out_file_name=None)¶
Dumps a list of paths to inegration pickle files to a file. One line per image. Provides easy input into post-refinement programs.
- Parameters:
out_file_name – the output file name.
- classmethod from_crystal_symmetries(crystal_symmetries, lattice_ids=None, _prefix='cluster_from_crystal_symmetries', _message='Made from list of individual cells', n_images=None, dials=False, **kwargs)¶
Constructor to get a cluster from a list of crystal symmetries.
- classmethod from_directories(path_to_integration_dir, _prefix='cluster_from_dir', n_images=None, dials=False, **kwargs)¶
Constructor to get a cluster from pickle files, from the recursively walked paths. Can take more than one argument for multiple folders. usage: Cluster.from_directories(..) :param path_to_integration_dir: list of directories containing pickle files. Will be searched recursively. :param n_images: find at most this number of images. :param use_b: Boolean. If True, intialise Scale and B. If False, use only mean intensity scalling.
- classmethod from_expts(refl_table=None, expts_list=None, _prefix='cluster_from_file', _message='Made from experiment objects', n_images=None, **kwargs)¶
Constructor to get a cluster from experiment and reflection list objects :param refl_table: DIALS integrated reflection table :param expts_list: DIALS experiment list :param n_images: find at most this number of images
- classmethod from_files(raw_input=None, pickle_list=[], dials_refls=[], dials_expts=[], _prefix='cluster_from_file', _message='Made from list of individual files', n_images=None, dials=False, json=False, **kwargs)¶
Constructor to get a cluster from a list of individual files. :param pickle_list: list of pickle files :param dials_refls: list of DIALS integrated reflections :param dials_expts: list of DIALS experiment jsons :param n_images: find at most this number of images :param dials: use the dials_refls and dials_expts arguments to construct the clusters (default: False) :param use_b: Boolean. If True, intialise Scale and B. If False, use only mean intensity scalling.
- classmethod from_iterable(iterable, _prefix='cluster_from_iterable', _message='Made from list of individual cells', **kwargs)¶
Constructor to get a cluster from an iterable (a list or tuple). The file must list unit cell a,b,c,alpha,beta,gamma and space_group_type, each as a single token. :param iterable: a list or a tuple
- classmethod from_list(file_name, raw_input=None, pickle_list=[], dials_refls=[], dials_expts=[], _prefix='cluster_from_file', _message='Made from list of individual cells', n_images=None, dials=False, **kwargs)¶
Constructor to get a cluster from a single file. The file must list unit cell a,b,c,alpha,beta,gamma and space_group_type, each as a single token. :param file_name: pathname of the file
- intensity_statistics(ax=None)¶
Uses the per-frame B and G fits (gradient and intercept of the ln(i) vs (sin(theta)/lambda)**2 plot) to create three agregate plots: 1) histogram of standard errors on the per-frame fits 2) histogram of B factors 3) scatter plot of intercept vs. gradient (G vs. B)
- Parameters:
ax – optionally hand the method three matplotlib axes objects to plot onto. If not specified, will plot the data.
- Returns:
the three axes, with the data plotted onto them.
- make_sub_cluster(new_members, new_prefix, new_info)¶
Make a sub-cluster from a list of SingleFrame objects from the old SingleFrame array.
- Parameters:
new_members – a new set of SingleFrame objects, typically a subset of
the current cluster. :param new_prefix: prefix to be passed directly to __init__ :param new_info: new information about this cluster. This is inteligently appended to the old info string so that the history of sub-clusters is tracted.
- merge_dict(use_fullies=False)¶
Make a dict of Miller indices with ([list of intensities], resolution) value tuples for each miller index.
- point_group_filter(point_group)¶
Return all the SingleFrames that have a given pointgroup.
- prime_postrefine(inputfile)¶
Run postrefinement from Prime on a cluster. Implements the prime API, and updates the SingleFrame objects with the attribute miller_fullies. :param cluster: a cluster object. :param inputfile: a Prime .inp file.
- print_ucs()¶
Prints a list of all the unit cells in the cluster to CSV.
- to_pandas()¶
- total_intensity_filter(res='', completeness_threshold=0.95, plot=False)¶
Note
This is still in development, use at own risk!
Creates an optimal sub-cluster using the fewest images that fulfil the criteria defined by: :param res: desired resolution. Defaults to that of the dataset. :param completeness: the desired completeness of the subset :param multiplicity: the desired multiplicity of the subset
- uc_feature_vector()¶
Return a len(cluster) * 6 numpy array of features for use in ML algos
- visualise_orientational_distribution(axes_to_return=None, cbar=True)¶
Creates a plot of the orientational distribution of the unit cells.
- Parameters:
axes_to_return – if None, print to screen, otherwise, requires 3 axes objects, and will return them.
cbar – boolean to specify if a color bar should be used.
xfel.clustering.cluster_groups module¶
Utilitites for dealing with lists of clusters.
- xfel.clustering.cluster_groups.unit_cell_info(sub_clusters)¶
Print unit cell information for a list of clusters.
- Parameters:
sub_clusters – a list of cluster objects
- Returns:
a string containing median unit cells, standard deviations and point group composition of each cluster.
xfel.clustering.singleframe module¶
Module for working with single images in a serial crystallography dataset
singleframe.SingleFrame¶
- class xfel.clustering.singleframe.SingleFrame(path=None, filename=None, crystal_num=0, remove_negative=False, use_b=True, scale=True, dicti=None, pixel_size=None)¶
Bases:
InputFrame
Class that creates single-image agregate metrics/scoring that can then be used in downstream clustering or filtering procedures.
- ANGSTROMS_TO_EV = 12398.425¶
- __init__(path=None, filename=None, crystal_num=0, remove_negative=False, use_b=True, scale=True, dicti=None, pixel_size=None)¶
Constructor for SingleFrame object, using a cctbx.xfel integration pickle.
- Parameters:
path – path to integration pickle
filename – the file name alone (used as a label)
crystal_num – if multiple lattices present, the latice number.
remove_negative – Boolean for removal of negative intensities
use_b – if True, initialise scale and B, if false, use only mean-intensity scaling.
dicti – optional. If a dictionairy is supplied here, will create object from that rather than attempting to read the file specified in path, filename.
pixel_size – the size of pixels in mm. Defaults to a MAR detector with a warning at debug level of logging.
scale – if False, will intialise scales to G=1, B=0.
- Returns:
a SingleFrame object, with the following Object attributes:
- Object attributes are:
is_polarization_corrected: Boolean flag indicatinf if polarization correction has been applied
miller_array: the cctbx.miller miller array of spot intensities.
mapped_predictions: the mapped_predictions locations
path: full path to the original file
name: file-name, used as an identifier
`crystal_system:
pg: point group of pickle
uc: Niggli unit cell as a tuple
orientation: cctbx crystal_orientation object
total_i: the total integrated intensity for this frame
xbeam: x-location of beam centre
ybeam: y-location of beam centre
`wavelength:
spot_offset: the mean offset between observed spots and predicted centroids. Only created if integration was performed using verbose_cv=True. Otherwise None.
minus_2B: the gradient of the ln(i) vs. sinsqtheta_over_lambda_sq plot
G: intercept of the of the ln(i) vs. sinsqtheta_over_lambda_sq plot
log_i: list of log_i intensities
sinsqtheta_over_lambda_sq: list of sinsqtheta_over_lambda_sq
wilson_err: standard error on the fit of ln(i) vs. sinsqtheta_over_lambda_sq
miller_fullies: a cctbx.miller array of fully recorded intensites.
- __module__ = 'xfel.clustering.singleframe'¶
- distance_from(other_uc)¶
Calculates distance using NCDist from Andrews and Bernstein J. Appl. Cryst. 2014 between this frame and some other unit cell. :param:other_uc: a 6-tuple of a, b, c, alpha, beta, gamma for some unit cell :return: the NCDist in A^2 to other_uc
- filter_negative_intensities()¶
Filters negative intensities from the Miller array. Acts in place. :return: acts in place.
- init_calc_wilson(use_b_factor, i_corrections=None)¶
If use_b_factor is :param i_corrections: allows flex array of correction factors (e.g. partialities) to be specified :param use_b_factor: if True, do a linear regression to fit G and B and returns the coeficients minus_2B, G, the transformed data log_i, and one_over_d_sqare. Also returns fit_stats, which is a dictionairy. If use_b_factor is False, then B is 0, and G is the mean intensity of the image. The r_value is then 0 (by definition), and the std_err is the standard error on the mean.
- Return minus_2B, G, log_i, on_over_d_square:
minus_2B: gradient of fit; G: intercept of fit; log_i: dependent variable of fit; one_over_d_square: independent variable of fit.
- static make_g6(uc)¶
Take a reduced Niggli Cell, and turn it into the G6 representation. This is similar but not identical to the metrical matrix. See doi:10.1107/S0567739473001063 Gruber (1973) doi:10.1107/S0108767388006427 Andrews and Bernstein (1988) doi:10.1107/S1600576713031002 Andrews and Bernstein (2014)
- n_reflections_by_sigi(sig_i_cuttoff)¶
Currently a placeholder that returns None.
This method should return the number of reflection in the frame that have an I/sig(I) > sig_i_cuttoff
- plot_wilson(width=30, ax=None)¶
Makes a log(I) vs 1/d**2 plot, displaying the raw partial data, a rolling average of the data, and the Wilson model fit to the data.
- Param:
width: smoothing window size
- Param:
ax: optional axes object to ve used for plotting
- polarization_correction()¶
Perform basic polarization correction in place, and change the is_polarization_corrected flag to True.
I_corrected = 2*I_uncorrected/(1 + cos(two_theta)**2)
- to_panda()¶
Returns the object attributes as a pandas series
- trim_res_limit(d_min=None, d_max=None)¶
Remove all miller indicies outside the range of _d_min, _d_max. Changes the object in place.
- Parameters:
d_min – min res of new miller array. Defaults to current value.
d_max – max res of new miller array. Defaults to current value.