bagua.torch_api.model_parallel.moe.layer

Module Contents

class bagua.torch_api.model_parallel.moe.layer.MoE(hidden_size, expert, num_local_experts=1, k=1, output_dropout_prob=0.0, capacity_factor=1.0, eval_capacity_factor=1.0, min_capacity=4, noisy_gate_policy=None)

Bases: torch.nn.Module

Initialize an MoE layer.

Parameters
  • hidden_size (int) – the hidden dimension of the model, importantly this is also the input and output dimension.

  • expert (torch.nn.Module) – the torch module that defines the expert (e.g., MLP, torch.linear).

  • num_local_experts (int, optional) – default=1, number of local experts per gpu.

  • k (int, optional) – default=1, top-k gating value, only supports k=1 or k=2.

  • output_dropout_prob (float, optional) – default=0.0, output dropout probability.

  • capacity_factor (float, optional) – default=1.0, the capacity of the expert at training time.

  • eval_capacity_factor (float, optional) – default=1.0, the capacity of the expert at eval time.

  • min_capacity (int, optional) – default=4, the minimum capacity per expert regardless of the capacity_factor.

  • noisy_gate_policy (str, optional) – default=None, noisy gate policy, valid options are ‘Jitter’, ‘RSample’ or ‘None’.

forward(self, hidden_states, used_token=None)

MoE forward

Parameters
  • hidden_states (Tensor) – input to the layer

  • used_token (Tensor, optional) – default: None, mask only used tokens

Returns

A tuple including output, gate loss, and expert count.

  • output (Tensor): output of the model

  • l_aux (Tensor): gate loss value

  • exp_counts (int): expert count