libtbx.easy_mp: functions for painless parallelization of Python code¶
pool_map: parallel map() function for Unix/Linux systems¶
pool_map
is essentially a simple replacement for the built-in map function, and is fast enough to be suitable for parallelization across arbitrarily large
shared-memory systems even for relatively short-running methods. Although it
is based on the map
method of the multiprocessing.Pool
class, the
underlying implementation has been modified to take advantage of the internal
behavior. On systems that support the os.fork()
call, it can embed a
reference to a “fixed” function that is preserved in the child processes,
bypassing all pickle
calls. This means that unpickleable objects can be
embedded in a callable which is given as the fixed_func
argument.
easy_mp
is therefore suitable for parallelizing existing code that would
otherwise require extensive refactoring (for instance, the weight optimization
in phenix.refine).
Note that since pool_map
depends on os.fork()
, it does not support
parallelization on Windows systems. However, it will simply run in serial on
Windows, allowing it to be used in a platform-independent manner.
- libtbx.easy_mp.pool_map(processes=None, initializer=None, initargs=(), maxtasksperchild=<libtbx.AutoType object>, func=None, fixed_func=None, iterable=None, args=None, chunksize=<libtbx.AutoType object>, func_wrapper='simple', index_args=True, log=None, call_back_for_serial_run=None)¶
Parallelized map() using subclassed multiprocessing.Pool. If func is not None, this function essentially calls the Pool’s own map method; this means that both func and iterable/args must be pickle-able. If fixed_func is not None, it will not be pickled but instead saved as an attribute of the Pool, which will be preserved after the fork() call. Additional features include optional redirection of output and automatic process number determination.
Note that because of the reliance on fork(), this function will run in serial on Windows, regardless of how many processors are available.
- Parameters:
processes – number of processes to spawn; if None or Auto, the get_processes() function will be used.
func – target function (will be pickled)
fixed_func – “fixed” target function, which will be be propagated to the child process when forked (instead of pickling)
iterable – argument list
args – same as iterable (alternate keyword)
chunksize – number of arguments to process at once
Examples
>>> def f(x): ... return some_long_running_method(x) ... >>> args = range(1000) >>> result = easy_mp.pool_map( ... func=f, ... args=args) ... >>> print len(result) ... 1000
>>> class f_caller(object): ... def __init__(self, non_pickleable_object): ... self._obj = non_pickleable_object ... def __call__(self, x): ... return some_long_running_method(x, self._obj) ... >>> args = range(1000) >>> f = f_caller(processed_pdb_file) >>> result = easy_mp.pool_map( ... fixed_func=f, ... args=args) ...
- libtbx.easy_mp.detect_problem()¶
Identify situations where multiprocessing will not work as required.
- libtbx.easy_mp.enable_multiprocessing_if_possible(nproc=<libtbx.AutoType object>, log=None)¶
Switch for using multiple CPUs with the pool_map function, usually called at the beginning of an app. If nproc is Auto or None and we are running Windows, it will be reset to 1.
- Parameters:
nproc – default number of processors to use
- Returns:
number of processors to use (None or Auto means automatic)
- libtbx.easy_mp.get_processes(processes)¶
Determine number of processes dynamically: number of CPUs minus the current load average (with a minimum of 1).
- Parameters:
processes – default number of processes (may be None or Auto)
- Returns:
actual number of processes to use
parallel_map: parallel map() function for multiple architectures¶
parallel_map
is a more general replacement for map()
, and supports
Windows systems in addition to Unix/Linux. It also enables use of managed
clusters via several common queuing systems such as Sun Grid Engine or PBS.
Because of the higher overhead involved, it is more suitable for functions
that take minutes or longer to run, but the use of cluster resources means that
it can handle much larger operations efficiently.
- libtbx.easy_mp.parallel_map(func, iterable, iterable_type=<function single_argument>, params=None, processes=1, method='multiprocessing', qsub_command=None, asynchronous=True, callback=None, preserve_order=True, preserve_exception_message=False, use_manager=False, stacktrace_handling='ignore', break_condition=None)¶
Generic parallel map() implementation for a variety of platforms, including the multiprocessing module and supported queuing systems, via the module libtbx.queuing_system_utils.scheduling. This is less flexible than pool_map above, since it does not provide a way to use a non-pickleable target function, but it provides a consistent API for programs where multiple execution methods are desired. It will also work on Windows (if the method is multiprocessing or threading).
Note that for most applications, the threading method will be constrained by the Global Interpreter Lock, therefore multiprocessing is prefered for parallelizing across a single multi-core system.
See Computational Crystallography Newsletter 3:37-42 (2012) for details of the underlying method.
- Parameters:
func – target function (must be pickleable)
iterable – list of arguments for func
processes – number of processes/threads to start
method – parallelization method (multiprocessing|threading|sge|lsf|pbs)
qsub_command – command to submit queue jobs (optional)
asynchronous – run queue jobs asynchronously
preserve_exception_message – keeps original exception message
preserve_order – keeps original order of results
break_condition – if break_condition(result) is True, break
- Returns:
a list of result objects