xfel.clustering package

xfel.clustering.cluster module

This module is designed to provide tools to deal with groups of serial crystallography images.

The class Cluster allows the creation, storage and manipulation of these sets of frames. Methods exist to create sub-clusters (new cluster objects) or to act on an existing cluster, e.g. to plot the unit cell distributions.

Author: Oliver Zeldin <zeldin@stanford.edu>

cluster.Cluster

class xfel.clustering.cluster.Cluster(data, cname='cluster', info='')

Bases: object

Class for operation on groups of single XFEL images (here described by SingleFrame objects) as cluster objects.

Objects can be created directily using the __init__ method, or by using a classmethod e.g. to create an object from a folder full of integration pickles. Whenever a method can plot, there is the option of passing it an appropriate number of matplotlib axes objects, which will then get returned for use in composite plots. See cluster.42 for an example. If no axes are passed, the methods will just plot the result to the screen. Clustering filters can act on these to break them up into cluster objects with different members. A ‘filter’ is just a clustering procedure that puts the passes and fails into different clusters. This is acheived through the make_sub_cluster() method. This also keeps track of a sub-clusters heritage through the .info string, which is appended to. The idea is to be able to write filter scripts for each data. e.g

>>> test_cluster = Cluster.from_directories(["~/test_data"],
...                                        'test_script')
>>> P3_only = test_cluster.point_group_filer('P3')
>>> sub_clusters = P3_only.ab_cluster(1200)
>>> big_cluster = max(sub_clusters, key=lambda x: len(x.members))
>>> best_data = big_cluster.total_intensity_filter(res=6.5,
...                                               completeness_threshold=0.1,
...                                               plot=False)
>>> print best_data.info

Subsequent postrefinenment/merging programs can be called on an output cluster.lst file:

>>> prime.postrefine params.phil $(cat cluster.lst)
__dict__ = mappingproxy({'__module__': 'xfel.clustering.cluster', '__doc__': 'Class for operation on groups of single XFEL images (here described by\n  SingleFrame objects) as cluster objects.\n\n  Objects can be created directily using the __init__ method, or by using a\n  classmethod e.g. to create an object from a folder full of integration\n  pickles. Whenever a method can plot, there is the option of\n  passing it an appropriate number of matplotlib axes objects, which will then\n  get returned for use in composite plots. See cluster.42 for an example.\n  If no axes are passed, the methods will just plot the result to the screen.\n  Clustering filters can act on these to break them up into cluster objects with\n  different members. A \'filter\' is just a clustering procedure that puts the\n  passes and fails into different clusters. This is acheived through the\n  make_sub_cluster() method. This also keeps track of a sub-clusters heritage\n  through the .info string, which is appended to. The idea is to be able to\n  write filter scripts for each data.\n  e.g ::\n\n      >>> test_cluster = Cluster.from_directories(["~/test_data"],\n      ...                                        \'test_script\')\n      >>> P3_only = test_cluster.point_group_filer(\'P3\')\n      >>> sub_clusters = P3_only.ab_cluster(1200)\n      >>> big_cluster = max(sub_clusters, key=lambda x: len(x.members))\n      >>> best_data = big_cluster.total_intensity_filter(res=6.5,\n      ...                                               completeness_threshold=0.1,\n      ...                                               plot=False)\n      >>> print best_data.info\n\n  Subsequent postrefinenment/merging programs can be called on an output\n  cluster.lst file::\n      >>> prime.postrefine params.phil $(cat cluster.lst)\n  ', '__init__': <function Cluster.__init__>, 'from_directories': <classmethod(<function Cluster.from_directories>)>, 'from_crystal_symmetries': <classmethod(<function Cluster.from_crystal_symmetries>)>, 'from_list': <classmethod(<function Cluster.from_list>)>, 'from_iterable': <classmethod(<function Cluster.from_iterable>)>, 'from_files': <classmethod(<function Cluster.from_files>)>, 'from_expts': <classmethod(<function Cluster.from_expts>)>, 'make_sub_cluster': <function Cluster.make_sub_cluster>, 'best_by_CC': <function Cluster.best_by_CC>, 'print_ucs': <function Cluster.print_ucs>, 'point_group_filter': <function Cluster.point_group_filter>, 'total_intensity_filter': <function Cluster.total_intensity_filter>, 'ab_cluster': <function Cluster.ab_cluster>, 'dump_file_list': <function Cluster.dump_file_list>, 'visualise_orientational_distribution': <function Cluster.visualise_orientational_distribution>, 'intensity_statistics': <function Cluster.intensity_statistics>, 'all_frames_intensity_stats': <function Cluster.all_frames_intensity_stats>, 'merge_dict': <function Cluster.merge_dict>, '__len__': <function Cluster.__len__>, 'dump_as_mtz': <function Cluster.dump_as_mtz>, 'to_pandas': <function Cluster.to_pandas>, 'uc_feature_vector': <function Cluster.uc_feature_vector>, 'prime_postrefine': <function Cluster.prime_postrefine>, '__dict__': <attribute '__dict__' of 'Cluster' objects>, '__weakref__': <attribute '__weakref__' of 'Cluster' objects>, '__annotations__': {}})
__init__(data, cname='cluster', info='')

Builds a cluster from a list of SingleFrame objects, as well as information about these as a cluster (e.g. mean unit cell).

Parameters:
  • data – a list of SingleFrame objects

  • cname – the name of the cluster, as a string.

  • info – an info-string for the cluster.

__len__()

Number of images in the cluster

__module__ = 'xfel.clustering.cluster'
__weakref__

list of weak references to the object

ab_cluster(threshold=10000, method='distance', linkage_method='single', log=False, ax=None, write_file_lists=True, schnell=False, doplot=True, labels='default')

Hierarchical clustering using the unit cell dimentions.

Parameters:
  • threshold – the threshold to use for prunning the tree into clusters.

  • method – which clustering method from scipy to use when creating the tree (see scipy.cluster.hierarchy)

  • linkage_method – which linkage method from scipy to use when creating the linkages. x (see scipy.cluster.hierarchy)

  • log – if True, use log scale on y axis.

  • ax – if a matplotlib axes object is provided, plot to this. Otherwise, create a new axes object and display on screen.

  • write_file_lists – if True, write out the files that make up each cluster.

  • schnell – if True, use simple euclidian distance, otherwise, use Andrews-Berstein distance from Andrews & Bernstein J Appl Cryst 47:346 (2014) on the Niggli cells.

  • doplot – Boolean flag for if the plotting should be done at all.

Runs faster if switched off. :param labels: ‘default’ will not display any labels for more than 100 images, but will display file names for fewer. This can be manually overidden with a boolean flag. :return: A list of Clusters ordered by largest Cluster to smallest

Note

Use ‘schnell’ option with caution, since it can cause strange behaviour around symmetry boundaries.

all_frames_intensity_stats(ax=None, smoothing_width=2000)

Goes through all frames in the cluster, and plots all the partial intensites. Then does a linear fit and rolling average on these.

Parameters:
  • smoothing_width – the width of the smoothing window.

  • ax – Optional matplotlib axes object to plot to. Otherwise, plot to screen.

Returns:

the axis, with the data plotted onto it.

best_by_CC(other, assert_is_similar_symmetry=False)

Return the SingleFrame object with the highest CC to a reference miller array. :param other: miller array object to be correlated against :return: a SingleFrame object.

dump_as_mtz(mtz_name, use_fullies=False)

Merge using weighted mean and standard deviation if all miller arrays

dump_file_list(out_file_name=None)

Dumps a list of paths to inegration pickle files to a file. One line per image. Provides easy input into post-refinement programs.

Parameters:

out_file_name – the output file name.

classmethod from_crystal_symmetries(crystal_symmetries, lattice_ids=None, _prefix='cluster_from_crystal_symmetries', _message='Made from list of individual cells', n_images=None, dials=False, **kwargs)

Constructor to get a cluster from a list of crystal symmetries.

classmethod from_directories(path_to_integration_dir, _prefix='cluster_from_dir', n_images=None, dials=False, **kwargs)

Constructor to get a cluster from pickle files, from the recursively walked paths. Can take more than one argument for multiple folders. usage: Cluster.from_directories(..) :param path_to_integration_dir: list of directories containing pickle files. Will be searched recursively. :param n_images: find at most this number of images. :param use_b: Boolean. If True, intialise Scale and B. If False, use only mean intensity scalling.

classmethod from_expts(refl_table=None, expts_list=None, _prefix='cluster_from_file', _message='Made from experiment objects', n_images=None, **kwargs)

Constructor to get a cluster from experiment and reflection list objects :param refl_table: DIALS integrated reflection table :param expts_list: DIALS experiment list :param n_images: find at most this number of images

classmethod from_files(raw_input=None, pickle_list=[], dials_refls=[], dials_expts=[], _prefix='cluster_from_file', _message='Made from list of individual files', n_images=None, dials=False, json=False, **kwargs)

Constructor to get a cluster from a list of individual files. :param pickle_list: list of pickle files :param dials_refls: list of DIALS integrated reflections :param dials_expts: list of DIALS experiment jsons :param n_images: find at most this number of images :param dials: use the dials_refls and dials_expts arguments to construct the clusters (default: False) :param use_b: Boolean. If True, intialise Scale and B. If False, use only mean intensity scalling.

classmethod from_iterable(iterable, _prefix='cluster_from_iterable', _message='Made from list of individual cells', **kwargs)

Constructor to get a cluster from an iterable (a list or tuple). The file must list unit cell a,b,c,alpha,beta,gamma and space_group_type, each as a single token. :param iterable: a list or a tuple

classmethod from_list(file_name, raw_input=None, pickle_list=[], dials_refls=[], dials_expts=[], _prefix='cluster_from_file', _message='Made from list of individual cells', n_images=None, dials=False, **kwargs)

Constructor to get a cluster from a single file. The file must list unit cell a,b,c,alpha,beta,gamma and space_group_type, each as a single token. :param file_name: pathname of the file

intensity_statistics(ax=None)

Uses the per-frame B and G fits (gradient and intercept of the ln(i) vs (sin(theta)/lambda)**2 plot) to create three agregate plots: 1) histogram of standard errors on the per-frame fits 2) histogram of B factors 3) scatter plot of intercept vs. gradient (G vs. B)

Parameters:

ax – optionally hand the method three matplotlib axes objects to plot onto. If not specified, will plot the data.

Returns:

the three axes, with the data plotted onto them.

make_sub_cluster(new_members, new_prefix, new_info)

Make a sub-cluster from a list of SingleFrame objects from the old SingleFrame array.

Parameters:

new_members – a new set of SingleFrame objects, typically a subset of

the current cluster. :param new_prefix: prefix to be passed directly to __init__ :param new_info: new information about this cluster. This is inteligently appended to the old info string so that the history of sub-clusters is tracted.

merge_dict(use_fullies=False)

Make a dict of Miller indices with ([list of intensities], resolution) value tuples for each miller index.

point_group_filter(point_group)

Return all the SingleFrames that have a given pointgroup.

prime_postrefine(inputfile)

Run postrefinement from Prime on a cluster. Implements the prime API, and updates the SingleFrame objects with the attribute miller_fullies. :param cluster: a cluster object. :param inputfile: a Prime .inp file.

print_ucs()

Prints a list of all the unit cells in the cluster to CSV.

to_pandas()
total_intensity_filter(res='', completeness_threshold=0.95, plot=False)

Note

This is still in development, use at own risk!

Creates an optimal sub-cluster using the fewest images that fulfil the criteria defined by: :param res: desired resolution. Defaults to that of the dataset. :param completeness: the desired completeness of the subset :param multiplicity: the desired multiplicity of the subset

uc_feature_vector()

Return a len(cluster) * 6 numpy array of features for use in ML algos

visualise_orientational_distribution(axes_to_return=None, cbar=True)

Creates a plot of the orientational distribution of the unit cells.

Parameters:
  • axes_to_return – if None, print to screen, otherwise, requires 3 axes objects, and will return them.

  • cbar – boolean to specify if a color bar should be used.

xfel.clustering.cluster_groups module

Utilitites for dealing with lists of clusters.

xfel.clustering.cluster_groups.unit_cell_info(sub_clusters)

Print unit cell information for a list of clusters.

Parameters:

sub_clusters – a list of cluster objects

Returns:

a string containing median unit cells, standard deviations and point group composition of each cluster.

xfel.clustering.singleframe module

Module for working with single images in a serial crystallography dataset

singleframe.SingleFrame

class xfel.clustering.singleframe.SingleFrame(path=None, filename=None, crystal_num=0, remove_negative=False, use_b=True, scale=True, dicti=None, pixel_size=None)

Bases: InputFrame

Class that creates single-image agregate metrics/scoring that can then be used in downstream clustering or filtering procedures.

ANGSTROMS_TO_EV = 12398.425
__init__(path=None, filename=None, crystal_num=0, remove_negative=False, use_b=True, scale=True, dicti=None, pixel_size=None)

Constructor for SingleFrame object, using a cctbx.xfel integration pickle.

Parameters:
  • path – path to integration pickle

  • filename – the file name alone (used as a label)

  • crystal_num – if multiple lattices present, the latice number.

  • remove_negative – Boolean for removal of negative intensities

  • use_b – if True, initialise scale and B, if false, use only mean-intensity scaling.

  • dicti – optional. If a dictionairy is supplied here, will create object from that rather than attempting to read the file specified in path, filename.

  • pixel_size – the size of pixels in mm. Defaults to a MAR detector with a warning at debug level of logging.

  • scale – if False, will intialise scales to G=1, B=0.

Returns:

a SingleFrame object, with the following Object attributes:

Object attributes are:
  • is_polarization_corrected: Boolean flag indicatinf if polarization correction has been applied

  • miller_array: the cctbx.miller miller array of spot intensities.

  • mapped_predictions: the mapped_predictions locations

  • path: full path to the original file

  • name: file-name, used as an identifier

  • `crystal_system:

  • pg: point group of pickle

  • uc: Niggli unit cell as a tuple

  • orientation: cctbx crystal_orientation object

  • total_i: the total integrated intensity for this frame

  • xbeam: x-location of beam centre

  • ybeam: y-location of beam centre

  • `wavelength:

  • spot_offset: the mean offset between observed spots and predicted centroids. Only created if integration was performed using verbose_cv=True. Otherwise None.

  • minus_2B: the gradient of the ln(i) vs. sinsqtheta_over_lambda_sq plot

  • G: intercept of the of the ln(i) vs. sinsqtheta_over_lambda_sq plot

  • log_i: list of log_i intensities

  • sinsqtheta_over_lambda_sq: list of sinsqtheta_over_lambda_sq

  • wilson_err: standard error on the fit of ln(i) vs. sinsqtheta_over_lambda_sq

  • miller_fullies: a cctbx.miller array of fully recorded intensites.

__module__ = 'xfel.clustering.singleframe'
distance_from(other_uc)

Calculates distance using NCDist from Andrews and Bernstein J. Appl. Cryst. 2014 between this frame and some other unit cell. :param:other_uc: a 6-tuple of a, b, c, alpha, beta, gamma for some unit cell :return: the NCDist in A^2 to other_uc

filter_negative_intensities()

Filters negative intensities from the Miller array. Acts in place. :return: acts in place.

init_calc_wilson(use_b_factor, i_corrections=None)

If use_b_factor is :param i_corrections: allows flex array of correction factors (e.g. partialities) to be specified :param use_b_factor: if True, do a linear regression to fit G and B and returns the coeficients minus_2B, G, the transformed data log_i, and one_over_d_sqare. Also returns fit_stats, which is a dictionairy. If use_b_factor is False, then B is 0, and G is the mean intensity of the image. The r_value is then 0 (by definition), and the std_err is the standard error on the mean.

Return minus_2B, G, log_i, on_over_d_square:

minus_2B: gradient of fit; G: intercept of fit; log_i: dependent variable of fit; one_over_d_square: independent variable of fit.

static make_g6(uc)

Take a reduced Niggli Cell, and turn it into the G6 representation. This is similar but not identical to the metrical matrix. See doi:10.1107/S0567739473001063 Gruber (1973) doi:10.1107/S0108767388006427 Andrews and Bernstein (1988) doi:10.1107/S1600576713031002 Andrews and Bernstein (2014)

n_reflections_by_sigi(sig_i_cuttoff)

Currently a placeholder that returns None.

This method should return the number of reflection in the frame that have an I/sig(I) > sig_i_cuttoff

plot_wilson(width=30, ax=None)

Makes a log(I) vs 1/d**2 plot, displaying the raw partial data, a rolling average of the data, and the Wilson model fit to the data.

Param:

width: smoothing window size

Param:

ax: optional axes object to ve used for plotting

polarization_correction()

Perform basic polarization correction in place, and change the is_polarization_corrected flag to True.

I_corrected = 2*I_uncorrected/(1 + cos(two_theta)**2)

to_panda()

Returns the object attributes as a pandas series

trim_res_limit(d_min=None, d_max=None)

Remove all miller indicies outside the range of _d_min, _d_max. Changes the object in place.

Parameters:
  • d_min – min res of new miller array. Defaults to current value.

  • d_max – max res of new miller array. Defaults to current value.