Quantization configuration#
Quark Quantization Config API for PyTorch
- class quark.torch.quantization.config.config.Config(global_quant_config: QuantizationConfig, layer_type_quant_config: dict[type[Module], QuantizationConfig] = {}, layer_quant_config: dict[str, QuantizationConfig] = {}, kv_cache_quant_config: dict[str, QuantizationConfig] = {}, kv_cache_group: list[str] = [], min_kv_scale: float = 0.0, softmax_quant_spec: QuantizationSpec | None = None, exclude: list[str] = [], algo_config: list[AlgoConfig] | None = None, quant_mode: QuantizationMode = QuantizationMode.eager_mode, log_severity_level: int | None = 1, version: str | None = '0.10')[source]#
A class that encapsulates comprehensive quantization configurations for a machine learning model, allowing for detailed and hierarchical control over quantization parameters across different model components.
- Parameters:
global_quant_config (QuantizationConfig) – Global quantization configuration applied to the entire model unless overridden at the layer level.
layer_type_quant_config (Dict[torch.nn.Module, QuantizationConfig]) – A dictionary mapping from layer types (e.g., nn.Conv2d, nn.Linear) to their quantization configurations.
layer_quant_config (Dict[str, QuantizationConfig]) – A dictionary mapping from layer names to their quantization configurations, allowing for per-layer customization. Default is
{}.kv_cache_quant_config (Dict[str, QuantizationConfig]) – A dictionary mapping from layer names to kv_cache quantization configurations. Default is
{}.softmax_quant_spec (Optional[QuantizationSpec]) – A quantization specifications of nn.functional.softmax output. Default is
None.exclude (List[str]) – A list of layer names to be excluded from quantization, enabling selective quantization of the model. Default is
[].algo_config (Optional[AlgoConfig]) – Optional configuration for the quantization algorithm, such as GPTQ, AWQ and Qronos. After this process, the datatype/fake_datatype of weights will be changed with quantization scales. Default is
None.quant_mode (QuantizationMode) – The quantization mode to be used (
eager_modeorfx_graph_mode). Default isQuantizationMode.eager_mode.log_severity_level (Optional[int]) – 0:DEBUG, 1:INFO, 2:WARNING. 3:ERROR, 4:CRITICAL/FATAL. Default is
1.
- class quark.torch.quantization.config.config.QuantizationConfig(input_tensors: QuantizationSpec | list[QuantizationSpec] | None = None, output_tensors: QuantizationSpec | list[QuantizationSpec] | None = None, weight: QuantizationSpec | list[QuantizationSpec] | None = None, bias: QuantizationSpec | list[QuantizationSpec] | None = None, target_device: DeviceType | None = None)[source]#
A data class that specifies quantization configurations for different components of a module, allowing hierarchical control over how each tensor type is quantized.
- Parameters:
input_tensors (Optional[Union[QuantizationSpec, List[QuantizationSpec]]]) – Input tensors quantization specification. If None, following the hierarchical quantization setup. e.g. If the
input_tensorsinlayer_type_quant_configisNone, the configuration fromglobal_quant_configwill be used instead. Defaults toNone. If None inglobal_quant_config,input_tensorsare not quantized.output_tensors (Optional[Union[QuantizationSpec, List[QuantizationSpec]]]) – Output tensors quantization specification. Defaults to
None. IfNone, the same as above.weight (Optional[Union[QuantizationSpec, List[QuantizationSpec]]]) – The weights tensors quantization specification. Defaults to
None. IfNone, the same as above.bias (Optional[Union[QuantizationSpec, List[QuantizationSpec]]]) – The bias tensors quantization specification. Defaults to
None. IfNone, the same as above.target_device (Optional[DeviceType]) – Configuration specifying the target device (e.g., CPU, GPU, IPU) for the quantized model.
- class quark.torch.quantization.config.config.TwoStageSpec(first_stage: DataTypeSpec | QuantizationSpec, second_stage: DataTypeSpec | QuantizationSpec)[source]#
A data class that specifies two-stage quantization configurations for different components of a module, allowing hierarchical control over how each tensor type is quantized.
- class quark.torch.quantization.config.config.ProgressiveSpec(first_stage: DataTypeSpec | QuantizationSpec, second_stage: DataTypeSpec | QuantizationSpec)[source]#
A data class that specifies a progressive quantization specification for a tensor. The first stage quantizes the input tensor, while the second stage quantizes the output from the first stage.
For example, to progressively quantize a float16 tensor:
First quantize it to fp8_e4m3 using fp8_e4m3 per-tensor quantization, get a fp8_e4m3 tensor.
Then quantize the fp8_e4m3 tensor to int4 using int4 per-channel quantization, get a int4 tensor.
The configuration for this example would be:
quant_spec = ProgressiveSpec( first_stage=FP8E4M3PerTensorSpec(observer_method="min_max", is_dynamic=False), second_stage=Int4PerChannelSpec(symmetric=False, scale_type="float", round_method="half_even", ch_axis=0, is_dynamic=False) ).to_quantization_spec()
- class quark.torch.quantization.config.config.ScaleQuantSpec(first_stage: DataTypeSpec | QuantizationSpec, second_stage: DataTypeSpec | QuantizationSpec)[source]#
A data class that specifies a two-stage quantization process for scale quantization.
The quantization happens in two stages:
First stage quantizes the input tensor itself.
Second stage quantizes the scale values from the first stage quantization.
For example, given a float16 tensor:
First quantize the tensor to fp4_e2m1 using fp4_e2m1 per-group quantization, producing a fp4_e2m1 tensor with float16 scale values.
Then quantize those float16 scale values to fp8_e4m3 using fp8_e4m3 per-tensor quantization.
The configuration for this example would be:
quant_spec = ScaleQuantSpec( first_stage=FP4PerGroupSpec(group_size=16, is_dynamic=False), second_stage=FP8E4M3PerTensorSpec(observer_method="min_max", is_dynamic=False) ).to_quantization_spec()
- class quark.torch.quantization.config.config.Uint4PerTensorSpec(observer_method: str | None = None, symmetric: bool | None = None, scale_type: str | None = None, round_method: str | None = None, is_dynamic: bool | None = None)[source]#
Helper class to define a
QuantizationSpecusing uint4 per tensor quantization.Example:
quantization_spec = Uint4PerTensorSpec(is_dynamic=True, symmetric=False).to_quantization_spec()
- class quark.torch.quantization.config.config.Uint4PerChannelSpec(symmetric: bool | None = None, scale_type: str | None = None, round_method: str | None = None, ch_axis: int | None = None, is_dynamic: bool | None = None, zero_point_type: str | None = 'int32')[source]#
Helper class to define a
QuantizationSpecusing uint4 per channel quantization.Example:
quantization_spec = Uint4PerChannelSpec( symmetric=True, scale_type="float", round_method="half_even", ch_axis=0, is_dynamic=False ).to_quantization_spec()
- class quark.torch.quantization.config.config.Uint4PerGroupSpec(symmetric: bool = False, ch_axis: int | None = None, is_dynamic: bool | None = None, scale_type: str | None = None, round_method: str | None = 'half_even', group_size: int | None = None)[source]#
Helper class to define a
QuantizationSpecusing uint4 per group quantization.Example:
quantization_spec = Uint4PerGroupSpec( symmetric=False, scale_type="float", round_method="half_even", ch_axis=1, is_dynamic=False, group_size=128 ).to_quantization_spec()
- class quark.torch.quantization.config.config.Int3PerGroupSpec(symmetric: bool | None = None, scale_type: str | None = None, round_method: str | None = None, ch_axis: int | None = None, is_dynamic: bool | None = None, group_size: int | None = None)[source]#
Helper class to define a
QuantizationSpecusing int3 per group quantization.Example:
quantization_spec = Int3PerGroupSpec( symmetric=True, scale_type="float", round_method="half_even", is_dynamic=False, group_size=32, ).to_quantization_spec()
- class quark.torch.quantization.config.config.Int3PerChannelSpec(symmetric: bool | None = None, scale_type: str | None = None, round_method: str | None = None, ch_axis: int | None = None, is_dynamic: bool | None = None)[source]#
Helper class to define a
QuantizationSpecusing int3 per channel quantization.Example:
quantization_spec = Int3PerChannelSpec( symmetric=False, scale_type="float", round_method="half_even", ch_axis=0, is_dynamic=False ).to_quantization_spec()
- class quark.torch.quantization.config.config.Int2PerGroupSpec(symmetric: bool | None = None, scale_type: str | None = None, round_method: str | None = None, ch_axis: int | None = None, is_dynamic: bool | None = None, group_size: int | None = None)[source]#
Helper class to define a
QuantizationSpecusing int2 per group quantization.Example:
quantization_spec = Int2PerGroupSpec( symmetric=True, scale_type="float", round_method="half_even", is_dynamic=False, group_size=32, ).to_quantization_spec()
- class quark.torch.quantization.config.config.Int4PerTensorSpec(observer_method: str | None = None, symmetric: bool | None = None, scale_type: str | None = None, round_method: str | None = None, is_dynamic: bool | None = None)[source]#
Helper class to define a
QuantizationSpecusing int4 per tensor quantization.Example:
quantization_spec = Int4PerTensorSpec( observer_method="min_max", symmetric=True, scale_type="float", round_method="half_even", is_dynamic=False ).to_quantization_spec()
- class quark.torch.quantization.config.config.Int4PerChannelSpec(symmetric: bool | None = None, scale_type: str | None = None, round_method: str | None = None, ch_axis: int | None = None, is_dynamic: bool | None = None)[source]#
Helper class to define a
QuantizationSpecusing int4 per channel quantization.Example:
quantization_spec = Int4PerChannelSpec( symmetric=False, scale_type="float", round_method="half_even", ch_axis=0, is_dynamic=False ).to_quantization_spec()
- class quark.torch.quantization.config.config.Int4PerGroupSpec(symmetric: bool = True, ch_axis: int | None = None, is_dynamic: bool | None = None, scale_type: str | None = None, round_method: str | None = 'half_even', group_size: int | None = None)[source]#
Helper class to define a
QuantizationSpecusing int4 per group quantization.Example:
quantization_spec = Int4PerGroupSpec( symmetric=True, scale_type="float", round_method="half_even", ch_axis=1, is_dynamic=False, group_size=128 ).to_quantization_spec()
- class quark.torch.quantization.config.config.Uint8PerTensorSpec(observer_method: str | None = None, symmetric: bool | None = None, scale_type: str | None = None, round_method: str | None = None, is_dynamic: bool | None = None)[source]#
Helper class to define a
QuantizationSpecusing uint8 per tensor quantization.Example:
quantization_spec = Uint8PerTensorSpec( observer_method="percentile", symmetric=True, scale_type="float", round_method="half_even", is_dynamic=False ).to_quantization_spec()
- class quark.torch.quantization.config.config.Uint8PerChannelSpec(symmetric: bool | None = None, scale_type: str | None = None, round_method: str | None = None, ch_axis: int | None = None, is_dynamic: bool | None = None)[source]#
Helper class to define a
QuantizationSpecusing uint8 per channel quantization.Example:
quantization_spec = Uint8PerChannelSpec( symmetric=True, scale_type="float", round_method="half_even", ch_axis=0, is_dynamic=False ).to_quantization_spec()
- class quark.torch.quantization.config.config.Uint8PerGroupSpec(symmetric: bool | None = None, scale_type: str | None = None, round_method: str | None = None, ch_axis: int | None = None, is_dynamic: bool | None = None, group_size: int | None = None)[source]#
Helper class to define a
QuantizationSpecusing uint8 per group quantization.Example:
quantization_spec = Uint8PerGroupSpec( symmetric=False, scale_type="float", round_method="half_even", ch_axis=1, is_dynamic=False, group_size=128 ).to_quantization_spec()
- class quark.torch.quantization.config.config.Int8PerTensorSpec(observer_method: str | None = None, symmetric: bool | None = None, scale_type: str | None = None, round_method: str | None = None, is_dynamic: bool | None = None)[source]#
Helper class to define a
QuantizationSpecusing int8 per tensor quantization.Example:
quantization_spec = Int8PerTensorSpec( observer_method="min_max", symmetric=True, scale_type="float", round_method="half_even", is_dynamic=False ).to_quantization_spec()
- class quark.torch.quantization.config.config.Int8PerChannelSpec(symmetric: bool | None = None, scale_type: str | None = None, round_method: str | None = None, ch_axis: int | None = None, is_dynamic: bool | None = None)[source]#
Helper class to define a
QuantizationSpecusing int8 per channel quantization.Example:
quantization_spec = Int8PerChannelSpec( symmetric=False, scale_type="float", round_method="half_even", ch_axis=0, is_dynamic=False ).to_quantization_spec()
- class quark.torch.quantization.config.config.Int8PerGroupSpec(symmetric: bool | None = None, scale_type: str | None = None, round_method: str | None = None, ch_axis: int | None = None, is_dynamic: bool | None = None, group_size: int | None = None)[source]#
Helper class to define a
QuantizationSpecusing int8 per group quantization.Example:
quantization_spec = Int8PerGroupSpec( symmetric=True, scale_type="float", round_method="half_even", ch_axis=1, is_dynamic=False, group_size=128 ).to_quantization_spec()
- class quark.torch.quantization.config.config.FP8E4M3PerTensorSpec(observer_method: str | None = None, scale_type: str | None = None, is_dynamic: bool | None = None)[source]#
Helper class to define a
QuantizationSpecusing FP8E4M3 per tensor quantization.Example:
quantization_spec = FP8E4M3PerTensorSpec( observer_method="min_max", is_dynamic=False ).to_quantization_spec()
- class quark.torch.quantization.config.config.FP8E4M3PerChannelSpec(symmetric: bool | None = None, scale_type: str | None = None, round_method: str | None = None, ch_axis: int | None = None, is_dynamic: bool | None = None)[source]#
Helper class to define a
QuantizationSpecusing FP8E4M3 per channel quantization.Example:
quantization_spec = FP8E4M3PerChannelSpec(is_dynamic=False, ch_axis=0).to_quantization_spec()
- class quark.torch.quantization.config.config.FP8E4M3PerGroupSpec(scale_format: str | None = 'float32', scale_calculation_mode: str | None = None, ch_axis: int | None = -1, is_dynamic: bool | None = None, group_size: int | None = None)[source]#
Helper class to define a
QuantizationSpecusing FP8E4M3 per group quantization.Example:
quantization_spec = FP8E4M3PerGroupSpec( ch_axis=-1, group_size=group_size, is_dynamic=True ).to_quantization_spec()
- class quark.torch.quantization.config.config.FP8E5M2PerTensorSpec(observer_method: str | None = None, symmetric: bool | None = None, scale_type: str | None = None, round_method: str | None = None, is_dynamic: bool | None = None)[source]#
Helper class to define a
QuantizationSpecusing FP8E5M2 per tensor quantization.Example:
quantization_spec = FP8E5M2PerTensorSpec( observer_method="min_max", is_dynamic=False ).to_quantization_spec()
- class quark.torch.quantization.config.config.FP8E5M2PerChannelSpec(symmetric: bool | None = None, scale_type: str | None = None, round_method: str | None = None, ch_axis: int | None = None, is_dynamic: bool | None = None)[source]#
Helper class to define a
QuantizationSpecusing FP8E5M2 per channel quantization.Example:
quantization_spec = FP8E5M2PerChannelSpec(is_dynamic=False, ch_axis=0).to_quantization_spec()
- class quark.torch.quantization.config.config.FP8E5M2PerGroupSpec(scale_format: str | None = 'float32', scale_calculation_mode: str | None = None, ch_axis: int | None = -1, is_dynamic: bool | None = None, group_size: int | None = None)[source]#
Helper class to define a
QuantizationSpecusing FP8E5M2 per group quantization.Example:
quantization_spec = FP8E5M2PerGroupSpec( ch_axis=-1, group_size=group_size, is_dynamic=True ).to_quantization_spec()
- class quark.torch.quantization.config.config.FP4PerGroupSpec(scale_format: str | None = 'float32', scale_calculation_mode: str | None = None, ch_axis: int | None = -1, is_dynamic: bool | None = None, group_size: int | None = None)[source]#
Helper class to define a
QuantizationSpecusing FP4 per group quantization.Example:
quantization_spec = FP4PerGroupSpec( ch_axis=-1, group_size=group_size, is_dynamic=True ).to_quantization_spec()
- class quark.torch.quantization.config.config.FP6E2M3PerGroupSpec(scale_format: 'str | None' = 'float32', scale_calculation_mode: 'str | None' = None, ch_axis: 'int | None' = -1, is_dynamic: 'bool | None' = None, group_size: 'int | None' = None)[source]#
- class quark.torch.quantization.config.config.FP6E3M2PerGroupSpec(scale_format: 'str | None' = 'float32', scale_calculation_mode: 'str | None' = None, ch_axis: 'int | None' = -1, is_dynamic: 'bool | None' = None, group_size: 'int | None' = None)[source]#
- class quark.torch.quantization.config.config.Float16Spec[source]#
Helper class to define a
QuantizationSpecusing float16 data type. The resulting QuantizationSpec does not quantize the tensor..Example:
quantization_spec = Float16Spec().to_quantization_spec()
- class quark.torch.quantization.config.config.Bfloat16Spec[source]#
Helper class to define a
QuantizationSpecusing bfloat16 data type. The resulting QuantizationSpec does not quantize the tensor..Example:
quantization_spec = Bfloat16Spec().to_quantization_spec()
- class quark.torch.quantization.config.config.OCP_MXFP8E4M3Spec(is_dynamic: bool = True, ch_axis: int = -1, scale_calculation_mode: str = 'even')[source]#
Helper class to define a
QuantizationSpecusing MX OCP data type using FP8E4M3.Example:
quantization_spec = OCP_MXFP8E4M3Spec( ch_axis=-1, is_dynamic=False ).to_quantization_spec()
- class quark.torch.quantization.config.config.OCP_MXFP8E5M2Spec(is_dynamic: bool = True, ch_axis: int = -1, scale_calculation_mode: str = 'even')[source]#
Helper class to define a
QuantizationSpecusing MX OCP data type using FP8E5M2.Example:
quantization_spec = OCP_MXFP8E5M2Spec( ch_axis=-1, is_dynamic=False ).to_quantization_spec()
- class quark.torch.quantization.config.config.OCP_MXFP6E3M2Spec(is_dynamic: bool = True, ch_axis: int = -1, scale_calculation_mode: str = 'even')[source]#
Helper class to define a
QuantizationSpecusing MX OCP data type using FP6E3M2.Example:
quantization_spec = OCP_MXFP6E3M2Spec( ch_axis=-1, is_dynamic=False ).to_quantization_spec()
- class quark.torch.quantization.config.config.OCP_MXFP6E2M3Spec(is_dynamic: bool = True, ch_axis: int = -1, scale_calculation_mode: str = 'even')[source]#
Helper class to define a
QuantizationSpecusing MX OCP data type using FP6E2M3.Example:
quantization_spec = OCP_MXFP6E2M3Spec( ch_axis=-1, is_dynamic=False ).to_quantization_spec()
- class quark.torch.quantization.config.config.OCP_MXFP4Spec(is_dynamic: bool = True, ch_axis: int = -1, scale_calculation_mode: str = 'even')[source]#
Helper class to define a
QuantizationSpecusing MX OCP data type using FP4.Example:
quantization_spec = OCP_MXFP4Spec( ch_axis=-1, is_dynamic=False ).to_quantization_spec()
- class quark.torch.quantization.config.config.OCP_MXINT8Spec(is_dynamic: bool = True, ch_axis: int = -1, scale_calculation_mode: str = 'even')[source]#
Helper class to define a
QuantizationSpecusing MX OCP data type using INT8.Example:
quantization_spec = OCP_MXINT8Spec( ch_axis=-1, is_dynamic=False ).to_quantization_spec()
- class quark.torch.quantization.config.config.OCP_MXFP4DiffsSpec(ch_axis: 'int | None' = None, is_dynamic: 'bool | None' = None, scale_calculation_mode: 'str | None' = 'even')[source]#
- class quark.torch.quantization.config.config.MX6Spec(ch_axis: int = -1, block_size: int = 32, scale_calculation_mode: str | None = 'even')[source]#
Helper class to define a
QuantizationSpecusing MX6 data type as defined in https://arxiv.org/pdf/2302.08007. More details are available in the Two Level Quantization Formats documentation.Example:
quantization_spec = MX6Spec(is_dynamic=False).to_quantization_spec()
- class quark.torch.quantization.config.config.MX9Spec(ch_axis: int = -1, block_size: int = 32, scale_calculation_mode: str | None = 'even')[source]#
Helper class to define a
QuantizationSpecusing MX9 data type as defined in https://arxiv.org/pdf/2302.08007. More details are available in the Two Level Quantization Formats documentation.Example:
quantization_spec = MX9Spec(is_dynamic=False).to_quantization_spec()
- class quark.torch.quantization.config.config.BFP16Spec(ch_axis: int = -1, scale_calculation_mode: str | None = 'even')[source]#
Helper class to define a
QuantizationSpecusing bfp16 data type.Example:
quantization_spec = BFP16Spec(is_dynamic=False).to_quantization_spec()
- class quark.torch.quantization.config.config.QuantizationSpec(dtype: Dtype, observer_cls: type[ObserverBase] | None = None, is_dynamic: bool | None = None, qscheme: QSchemeType | None = None, ch_axis: int | None = None, group_size: int | None = None, symmetric: bool | None = None, round_method: RoundType | None = None, scale_type: ScaleType | None = None, scale_format: str | None = None, scale_calculation_mode: str | None = None, qat_spec: QATSpec | None = None, mx_element_dtype: Dtype | None = None, zero_point_type: ZeroPointType | None = ZeroPointType.int32, is_scale_quant: bool = False)[source]#
A data class that defines the specifications for quantizing tensors within a model.
- Parameters:
dtype (Dtype) – The data type for quantization (e.g., int8, int4).
is_dynamic (Optional[bool]) – Specifies whether dynamic or static quantization should be used. Default is
None, which indicates no specification.observer_cls (Optional[Type[ObserverBase]]) – The class of observer to be used for determining quantization parameters like min/max values. Default is
None.qscheme (Optional[QSchemeType]) – The quantization scheme to use, such as per_tensor, per_channel or per_group. Default is
None.ch_axis (Optional[int]) – The channel axis for per-channel quantization. Default is
None.group_size (Optional[int]) – The size of the group for per-group quantization, also the block size for MX datatypes. Default is
None.symmetric (Optional[bool]) – Indicates if the quantization should be symmetric around zero. If True, quantization is symmetric. If
None, it defers to a higher-level or global setting. Default isNone.round_method (Optional[RoundType]) – The rounding method during quantization, such as half_even. If None, it defers to a higher-level or default method. Default is
None.scale_type (Optional[ScaleType]) – Defines the scale type to be used for quantization, like power of two or float. If
None, it defers to a higher-level setting or uses a default method. Default isNone.mx_element_dtype (Optional[Dtype]) – Defines the data type to be used for the element type when using mx datatypes, the shared scale effectively uses FP8 E8M0.
is_scale_quant (Optional[bool]) – Indicates whether this spec is for quantizing scales rather than tensors. Default is
False.
Example:
from quark.torch.quantization.config.type import Dtype, ScaleType, RoundType, QSchemeType from quark.torch.quantization.config.config import QuantizationSpec from quark.torch.quantization.observer.observer import PerChannelMinMaxObserver quantization_spec = QuantizationSpec( dtype=Dtype.int8, qscheme=QSchemeType.per_channel, observer_cls=PerChannelMinMaxObserver, symmetric=True, scale_type=ScaleType.float, round_method=RoundType.half_even, is_dynamic=False, ch_axis=1, )
- class quark.torch.quantization.config.config.TQTSpec(threshold_init_meth: TQTThresholdInitMeth | None = None)[source]#
Configuration for the Trained Quantization Thresholds (TQT) post-training quantization method, implementing https://arxiv.org/abs/1903.08066.
- quark.torch.quantization.config.config.load_pre_optimization_config_from_file(file_path: str) PreQuantOptConfig[source]#
Load pre-optimization configuration from a JSON file.
- Parameters:
file_path (str) – The path to the JSON file containing the pre-optimization configuration.
- Returns:
The pre-optimization configuration.
- Return type:
- quark.torch.quantization.config.config.load_quant_algo_config_from_file(file_path: str) AlgoConfig[source]#
Load quantization algorithm configuration from a JSON file.
- Parameters:
file_path (str) – The path to the JSON file containing the quantization algorithm configuration.
- Returns:
The quantization algorithm configuration.
- Return type:
- class quark.torch.quantization.config.config.SmoothQuantConfig(name: str = 'smooth', alpha: float = 1, scale_clamp_min: float = 0.001, scaling_layers: list[dict[str, Any]] = [], model_decoder_layers: str = '')[source]#
A data class that defines the specifications for Smooth Quantization.
- Parameters:
name (str) – The name of the configuration, typically used to identify different quantization settings. Default is
"smooth".alpha (int) – The factor of adjustment in the quantization formula, influencing how aggressively weights are quantized. Default is
1.scale_clamp_min (float) – The minimum scaling factor to be used during quantization, preventing the scale from becoming too small. Default is
1e-3.scaling_layers (List[Dict[str, Any]]) – Specific settings for scaling layers, allowing customization of quantization parameters for different layers within the model. Default is
None.model_decoder_layers (str) – Specifies any particular decoder layers in the model that might have unique quantization requirements. Default is
None.
The parameter
scaling_layerscan be left to an empty list (default), in which case they will be automatically detected.Example:
from quark.torch.quantization.config.config import SmoothQuantConfig scaling_layers=[ { "prev_op": "input_layernorm", "layers": ["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"], "inp": "self_attn.q_proj", "module2inspect": "self_attn" }, { "prev_op": "post_attention_layernorm", "layers": ["mlp.gate_proj", "mlp.up_proj"], "inp": "mlp.gate_proj", "module2inspect": "mlp" } ] smoothquant_config = SmoothQuantConfig( scaling_layers=scaling_layers, model_decoder_layers="model.layers" )
- class quark.torch.quantization.config.config.RotationConfig(model_decoder_layers: str, scaling_layers: dict[str, list[dict[str, Any]]], name: str = 'rotation', random: bool = False)[source]#
A data class that defines the specifications for rotation settings in processing algorithms.
- Parameters:
name (str) – The name of the configuration, typically used to identify different rotation settings. Default is
"rotation".random (bool) – A boolean flag indicating whether the rotation should be applied randomly. This can be useful for data augmentation purposes where random rotations may be required. Default is
False.scaling_layers (List[Dict[str, Any]]) – Specific settings for scaling layers, allowing customization of quantization parameters for different layers within the model. Default is
[].
Example:
from quark.torch.quantization.config.config import RotationConfig scaling_layers=[ { "prev_op": "input_layernorm", "layers": ["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"], "inp": "self_attn.q_proj", "module2inspect": "self_attn" }, { "prev_op": "post_attention_layernorm", "layers": ["mlp.gate_proj", "mlp.up_proj"], "inp": "mlp.gate_proj", "module2inspect": "mlp" } ] rotation_config = RotationConfig( scaling_layers=scaling_layers, model_decoder_layers="model.layers" )
- class quark.torch.quantization.config.config.QuaRotConfig(scaling_layers: dict[str, list[dict[str, Any]]], name: str = 'quarot', r1: bool = True, r2: bool = True, r3: bool = True, r4: bool = True, rotation_size: bool | None = None, random_r1: bool = False, random_r2: bool = False, optimized_rotation_path: str | None = None, backbone: str = 'model', model_decoder_layers: str = 'model.layers', v_proj: str = 'self_attn.v_proj', o_proj: str = 'self_attn.o_proj', self_attn: str = 'self_attn', mlp: str = 'mlp')[source]#
A data class that defines the specifications for the QuaRot algorithm.
- Parameters:
name (str) – The name of the configuration, typically used to identify different rotation settings. Default is
"quarot".r1 (bool) – Whether to apply
R1rotation. See SpinQuant paper for details. Defaults toTrue.r2 (bool) – Whether to apply
R2rotation. See SpinQuant paper for details. Defaults toTrue.r3 (bool) – Whether to apply
R3rotation. It is only useful when using KV cache quantization. See SpinQuant paper for details. Defaults toTrue.r4 (bool) – Whether to apply
R4rotation. See SpinQuant paper for details. Defaults toTrue.rotation_size (Optional[int]) – The size of rotations to apply on activations/weights. By default, the activation last dimension (e.g.
hidden_size), or weight input/output channel dimension is used as rotation size. In case the parameterrotation_sizeis specified, smaller rotations of size(rotation_size, rotation_size)are applied per-block. Defaults toNone.random_r1 (bool) – A boolean flag indicating whether
R1should be a random Hadamard matrix. See SpinQuant paper for details. This can be useful for data augmentation purposes where random rotations may be required. Default isFalse.random_r2 (bool) – A boolean flag indicating whether
R2should be a random Hadamard matrix. See SpinQuant paper for details. This can be useful for data augmentation purposes where random rotations may be required. Default isFalse.random_r1andrandom_r2are only relevant if we are using Hadamard rotations forR1andR2. If the argumentoptimized_rotation_pathis specified, then we will loadR1andR2matrices from a file instad of using Hadamard matrices.scaling_layers (List[Dict[str, str]]) – Specific settings for scaling layers, allowing customization of quantization parameters for different layers within the model. Default is
None.optimized_rotation_path (Optional[str]) – The path to the file ‘R.bin’ that has saved optimized
R1(per model) andR2(per decoder) matrices. If this is specified,R1andR2rotations will be loaded from this file. Otherwise they will be Hadamard matrices.backbone (str) – A string indicating the path to the model backbone.
model_decoder_layers (str) – A string indicating the path to the list of decoder layers.
v_proj (str) – A string indicating the path to the v projection layer, starting from the decoder layer it is in.
o_proj (str) – A string indicating the path to the o projection layer, starting from the decoder layer it is in.
self_attn (str) – A string indicating the path to the self attention block, starting from the decoder layer it is in.
mlp (str) – A string indicating the path to the multilayer perceptron layer, starting from the decoder layer it is in.
Example:
from quark.torch.quantization.config.config import QuaRotConfig quarot_config = QuaRotConfig( model_decoder_layers="model.layers", v_proj="self_attn.v_proj", o_proj="self_attn.o_proj", self_attn="self_attn", mlp="mlp" )
- class quark.torch.quantization.config.config.AutoSmoothQuantConfig(name: str = 'autosmoothquant', scaling_layers: list[dict[str, Any]] | None = None, model_decoder_layers: str | None = None, compute_scale_loss: str | None = 'MSE')[source]#
A data class that defines the specifications for AutoSmoothQuant.
- Parameters:
name (str) – The name of the quantization configuration. Default is
"autosmoothquant".scaling_layers (List[Dict[str, str]]) – Configuration details for scaling layers within the model, specifying custom scaling parameters per layer. Default is
None.compute_scale_loss (str) – Calculate the best scale loss, “MSE” or “MAE”. Default is
"MSE".model_decoder_layers (str) – Specifies the layers involved in model decoding that may require different quantization parameters. Default is
None.
Example:
from quark.torch.quantization.config.config import AutoSmoothQuantConfig scaling_layers = [ { "prev_op": "input_layernorm", "layers": ["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"], "inp": "self_attn.q_proj", "module2inspect": "self_attn" }, { "prev_op": "self_attn.v_proj", "layers": ["self_attn.o_proj"], "inp": "self_attn.o_proj" }, { "prev_op": "post_attention_layernorm", "layers": ["mlp.gate_proj", "mlp.up_proj"], "inp": "mlp.gate_proj", "module2inspect": "mlp" }, { "prev_op": "mlp.up_proj", "layers": ["mlp.down_proj"], "inp": "mlp.down_proj" } ] autosmoothquant_config = AutoSmoothQuantConfig( model_decoder_layers="model.layers", scaling_layers=scaling_layers )
- class quark.torch.quantization.config.config.AWQConfig(name: str = 'awq', scaling_layers: list[dict[str, Any]] = [], model_decoder_layers: str = '')[source]#
Configuration for Activation-aware Weight Quantization (AWQ).
- Parameters:
name (str) – The name of the quantization configuration. Default is
"awq".scaling_layers (List[Dict[str, Any]]) – Configuration details for scaling layers within the model, specifying custom scaling parameters per layer. Default is
None.model_decoder_layers (str) – Specifies the layers involved in model decoding that may require different quantization parameters. Default is
None.
The parameter
scaling_layerscan be left to an empty list (default), in which case they will be automatically detected.Example:
from quark.torch.quantization.config.config import AWQConfig scaling_layers = [ { "prev_op": "input_layernorm", "layers": ["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"], "inp": "self_attn.q_proj", "module2inspect": "self_attn" }, { "prev_op": "post_attention_layernorm", "layers": ["mlp.gate_proj", "mlp.up_proj"], "inp": "mlp.gate_proj", "module2inspect": "mlp" }, ] awq_config = AWQConfig( model_decoder_layers="model.layers", scaling_layers=scaling_layers )
- class quark.torch.quantization.config.config.GPTQConfig(name: str = 'gptq', block_size: int = 128, damp_percent: float = 0.01, desc_act: bool = True, static_groups: bool = True, inside_layer_modules: list[str] = [], model_decoder_layers: str = '')[source]#
A data class that defines the specifications for Accurate Post-Training Quantization for Generative Pre-trained Transformers (GPTQ).
- Parameters:
name (str) – The configuration name. Default is
"gptq".block_size (int) – GPTQ divides the columns into blocks of size block_size and quantizes each block separately. Default is
128.damp_percent (float) – The percentage used to dampen the quantization effect, aiding in the maintenance of accuracy post-quantization. Default is
0.01.desc_act (bool) – Indicates whether descending activation is used, typically to enhance model performance with quantization. Default is
True.static_groups (bool) – Specifies whether the order of groups for quantization are static or can be dynamically adjusted. Default is
True. Quark export only support static_groups as True.inside_layer_modules (List[str]) – Lists the names of internal layer modules within the model that require specific quantization handling. Default is
None.model_decoder_layers (str) – Specifies custom settings for quantization on specific decoder layers of the model. Default is
None.
Example:
from quark.torch.quantization.config.config import GPTQConfig gptq_config = GPTQConfig( inside_layer_modules=[ "self_attn.k_proj", "self_attn.v_proj", "self_attn.q_proj", "self_attn.o_proj", "mlp.up_proj", "mlp.gate_proj", "mlp.down_proj" ], model_decoder_layers="model.layers" )
- class quark.torch.quantization.config.config.QronosConfig(inside_layer_modules: list[str], model_decoder_layers: str, name: str = 'qronos', block_size: int = 128, desc_act: bool = True, static_groups: bool = True, alpha: float = 0.001, beta: float = 10000.0)[source]#
Configuration for Qronos, an advanced post-training quantization algorithm. Implemented as proposed in https://arxiv.org/pdf/2505.11695
- Parameters:
inside_layer_modules (List[str]) – Lists the names of internal layer modules within the model that require specific quantization handling.
model_decoder_layers (str) – Specifies custom settings for quantization on specific decoder layers of the model.
name (str) – The configuration name. Default is
"qronos".block_size (int) – Qronos divides the columns into blocks of size block_size and quantizes each block separately. Default is
128.desc_act (bool) – Indicates whether descending activation is used, typically to enhance model performance with quantization. Default is
True.static_groups (bool) – Specifies whether the order of groups for quantization are static or can be dynamically adjusted. Default is
True. Quark export only supportsstatic_groups=True.alpha (float) – Dampening factor for numerical stability during matrix inversions. Default is
1e-6.beta (float) – Stabilisation factor for Cholesky decomposition. Default is
1e4.
Example:
from quark.torch.quantization.config.config import QronosConfig qronos_config = QronosConfig( inside_layer_modules=[ "self_attn.k_proj", "self_attn.v_proj", "self_attn.q_proj", "self_attn.o_proj", "mlp.up_proj", "mlp.gate_proj", "mlp.down_proj" ], model_decoder_layers="model.layers" )