bagua.torch_api.bucket

Module Contents

class bagua.torch_api.bucket.BaguaBucket(tensors, name, flatten, alignment=1)

Create a Bagua bucket with a list of Bagua tensors.

Parameters
  • tensors – A list of Bagua tensors to be put in the bucket.

  • name – The unique name of the bucket.

  • flatten – If True, flatten the input tensors so that they are contiguous in memory.

  • alignment – If alignment > 1, Bagua will create a padding tensor to the bucket so that the total number of elements in the bucket divides the given alignment.

name

The bucket’s name.

tensors

The tensors contained within the bucket.

append_centralized_synchronous_op(self, hierarchical=False, average=True, scattergather=False, compression=None)

Append a centralized synchronous operation to a bucket. It will sum or average the tensors in the bucket for all workers.

The operations will be executed by the Bagua backend in the order they are appended when all the tensors within the bucket are marked ready.

Parameters
  • hierarchical (bool) – Enable hierarchical communication. Which means the GPUs on the same machine will communicate will each other first. After that, machines do inter-node communication. This can boost performance when the inter-node communication cost is high.

  • average (bool) – If True, the gradients on each worker are averaged. Otherwise, they are summed.

  • scattergather (bool) – If true, the communication between workers are done with scatter gather instead of allreduce. This is required for using compression.

  • compression (Optional[str]) – If not None, the tensors will be compressed for communication. Currently “MinMaxUInt8” is supported.

Returns

The bucket itself.

Return type

BaguaBucket

append_decentralized_synchronous_op(self, hierarchical=True, peer_selection_mode='all', communication_interval=1)

Append a decentralized synchronous operation to a bucket. It will do gossipy style model averaging among workers.

The operations will be executed by the Bagua backend in the order they are appended when all the tensors within the bucket are marked ready.

Parameters
  • hierarchical (bool) – Enable hierarchical communication. Which means the GPUs on the same machine will communicate will each other first. After that, machines do inter-node communication. This can boost performance when the inter-node communication cost is high.

  • peer_selection_mode (str) – Can be “all” or “shift_one”. “all” means all workers’ weights are averaged in each communication step. “shift_one” means each worker selects a different peer to do weights average in each communication step.

  • communication_interval (int) – Number of iterations between two communication steps.

Returns

The bucket itself.

Return type

BaguaBucket

append_python_op(self, python_function)

Append a Python operation to a bucket. A Python operation is a Python function that takes the bucket’s name and returns None. It can do arbitrary things within the function body.

The operations will be executed by the Bagua backend in the order they are appended when all the tensors within the bucket are marked ready.

Parameters

python_function (Callable[[str], None]) – The Python operation function.

Returns

The bucket itself.

Return type

BaguaBucket

check_flatten(self)
Returns

True if the bucket’s tensors are contiguous in memory.

Return type

bool

clear_ops(self)

Clear the previously appended operations.

Return type

BaguaBucket