<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=145304570664993&amp;ev=PageView&amp;noscript=1">
Audio waveform

Aug 09, 2023

Fine-tune OpenAI's Whisper Automatic Speech Recognition (ASR) model

Written By:

Goran Katalinic

Join the IPU conversation

Join our Graphcore community for free. Get help and share knowledge, find tutorials and tools that will help you grow.

Join on Slack

Whisper – the open source automatic speech recognition (ASR) model created by OpenAI – is incredibly powerful out of the box. 

It is trained on 680,000 hours of labelled audio data, 117,000 hours of which cover 96 languages other than English, meaning that it can be applied to a wide range of applications with great results. 

The vanilla version Whisper is available to run for inference in a Paperspace Gradient Notebook, powered by Graphcore IPUs. 

There are also good reasons to fine-tune Whisper for a particular use case. This could include accounting for the complex and sometimes subtle differences in speech and vocabulary as influenced by: 

  • A less common spoken language
  • Locale and dialect
  • A particular domain, such as scientific, medical, and legal

Where can I get audio data for fine-tuning Whisper?

Some organisations may have large amounts of proprietary audio data that can be used in the fine-tuning process. For others, gathering the audio necessary for fine-tuning is not a trivial undertaking.

Thankfully, there are several open-sourced speech recognition datasets available, covering multiple languages. The largest of these are:

 There are smaller datasets covering many more languages and dialects, such as:

  • VoxPopuli: 1,800 hours, 16 languages
  • Fleurs: 12 hours per language, 102 languages
  • There are also individual datasets hosted by OpenSLR

In our Paperspace Gradient Notebook, we demonstrate fine-tuning using the Catalan subset of OpenSLR. 

How to fine-tune Whisper on Graphcore IPUs

Get started by running the Whisper Small Fine Tuning notebook on Paperspace.

For each code block below, you can simply click to run the block in Paperspace - making any modifications to code/parameters, where relevant. We explain how to run the process in environments other than Paperspace Gradient Notebooks at the end of this blog.

Install dependencies

# Install optimum-graphcore from source  !pip install git+https://github.com/huggingface/optimum-graphcore.git@v0.7.1 "soundfile" "librosa" "evaluate" "jiwer" %pip install "graphcore-cloud-tools[logger] @ git+https://github.com/graphcore/graphcore-cloud-tools" %load_ext graphcore_cloud_tools.notebook_logging.gc_logger

import os n_ipu = int(os.getenv("NUM_AVAILABLE_IPU", 4)) executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "/tmp/exe_cache/") + "/whisper"

# Generic imports from dataclasses import dataclass from typing import Any, Dict, List, Union import evaluate import numpy as np import torch from datasets import load_dataset, Audio, Dataset, DatasetDict # IPU-specific imports from optimum.graphcore import (     IPUConfig,      IPUSeq2SeqTrainer,      IPUSeq2SeqTrainingArguments,  ) from optimum.graphcore.models.whisper import WhisperProcessorTorch # HF-related imports from transformers import WhisperForConditionalGeneration

Load dataset

Common Voice datasets consist of recordings of speakers reading text from Wikipedia in different languages. 🤗 Datasets enables us to easily download and prepare the training and evaluation splits.

First, ensure you have accepted the terms of use on the 🤗 Hub: mozilla-foundation/common_voice_13_0. Once you have accepted the terms, you will have full access to the dataset and be able to download the data locally.

dataset = DatasetDict() split_dataset = Dataset.train_test_split(     load_dataset("openslr", "SLR69", split="train", token=False), test_size=0.2, seed=0 ) dataset["train"] = split_dataset["train"] dataset["eval"] = split_dataset["test"] print(dataset)

The columns of interest are:

  • audio: the raw audio samples
  • sentence: the corresponding ground truth transcription.

We drop the path column.

dataset = dataset.remove_columns(["path"])

Since Whisper was pre-trained on audio sampled at 16 kHz, we must ensure the Common Voice samples are downsampled accordingly.

dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

Prepare Dataset

We prepare the datasets by extracting features from the raw audio inputs and injecting labels which are simply transcriptions with some basic processing.

The feature extraction is provided by 🤗 Transformers WhisperFeatureExtractor. To decode generated tokens into text after running the model, we will similarly require a tokenizer, WhisperTokenizer. Both of these are wrapped by an instance of WhisperProcessor.

MODEL_NAME = "openai/whisper-small" LANGUAGE = "spanish" TASK = "transcribe" MAX_LENGTH = 224 processor = WhisperProcessorTorch.from_pretrained(MODEL_NAME, language=LANGUAGE, task=TASK) processor.tokenizer.pad_token = processor.tokenizer.eos_token processor.tokenizer.max_length = MAX_LENGTH processor.tokenizer.set_prefix_tokens(language=LANGUAGE, task=TASK)

def prepare_dataset(batch, processor):     inputs = processor.feature_extractor(         raw_speech=batch["audio"]["array"],         sampling_rate=batch["audio"]["sampling_rate"],     )     batch["input_features"] = inputs.input_features[0].astype(np.float16)     transcription = batch["sentence"]     batch["labels"] = processor.tokenizer(text=transcription).input_ids     return batch columns_to_remove = dataset.column_names["train"] dataset = dataset.map(     lambda elem: prepare_dataset(elem, processor),     remove_columns=columns_to_remove,     num_proc=1, ) train_dataset = dataset["train"] eval_dataset = dataset["eval"]

Lastly, we pre-process the labels by padding them with values that will be ignored during fine-tuning. This padding is to ensure tensors of static shape are provided to the model. We do this on the fly via the data collator below.

@dataclass class DataCollatorSpeechSeq2SeqWithLabelProcessing:     processor: Any     def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:         batch = {}         batch["input_features"] = torch.tensor([feature["input_features"] for feature in features])                  label_features = [{"input_ids": feature["labels"]} for feature in features]         labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt", padding="longest", pad_to_multiple_of=MAX_LENGTH)         labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)         batch["labels"] = labels         return batch

Define metrics

The performance of our fine-tuned model will be evaluated using word error rate (WER).

metric = evaluate.load("wer") def compute_metrics(pred, tokenizer):     pred_ids = pred.predictions     label_ids = pred.label_ids     # replace -100 with the pad_token_id     pred_ids = np.where(pred_ids != -100, pred_ids, tokenizer.pad_token_id)     label_ids = np.where(label_ids != -100, label_ids, tokenizer.pad_token_id)     pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)     label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)     normalized_pred_str = [tokenizer._normalize(pred).strip() for pred in pred_str]     normalized_label_str = [tokenizer._normalize(label).strip() for label in label_str]     wer = 100 * metric.compute(predictions=pred_str, references=label_str)     normalized_wer = 100 * metric.compute(predictions=normalized_pred_str, references=normalized_label_str)     return {"wer": wer, "normalized_wer": normalized_wer}

Load pre-trained model

model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)

model.config.max_length = MAX_LENGTH model.generation_config.max_length = MAX_LENGTH

Ensure language-appropriate tokens, if any, are set for generation. We set them on both the config and the generation_config to ensure they are used correctly during generation.

model.config.forced_decoder_ids = processor.tokenizer.get_decoder_prompt_ids(     language=LANGUAGE, task=TASK ) model.config.suppress_tokens = [] model.generation_config.forced_decoder_ids = processor.tokenizer.get_decoder_prompt_ids(     language=LANGUAGE, task=TASK ) model.generation_config.suppress_tokens = []

Fine-tuning Whisper on the IPU

The model can be directly fine-tuned on the IPU using the IPUSeq2SeqTrainer class.

The IPUConfig object specifies how the model will be pipelined across the IPUs.

For fine-tuning, we place the encoder on two IPUs, and the decoder on two IPUs.

For inference, the encoder is placed on one IPU, and the decoder on a different IPU.

replication_factor = n_ipu // 4 ipu_config = IPUConfig.from_dict(     {         "optimizer_state_offchip": True,         "recompute_checkpoint_every_layer": True,         "enable_half_partials": True,         "executable_cache_dir": executable_cache_dir,         "gradient_accumulation_steps": 16,         "replication_factor": replication_factor,         "layers_per_ipu": [5, 7, 5, 7],         "matmul_proportion": [0.2, 0.2, 0.6, 0.6],         "projection_serialization_factor": 5,         "inference_replication_factor": 1,         "inference_layers_per_ipu": [12, 12],         "inference_parallelize_kwargs": {             "use_cache": True,             "use_encoder_output_buffer": True,             "on_device_generation_steps": 16,         }     } )

Lastly, we specify the arguments controlling the training process.

total_steps = 1000 // replication_factor training_args = IPUSeq2SeqTrainingArguments(     output_dir="./whisper-small-ipu-checkpoints",     do_train=True,     do_eval=True,     predict_with_generate=True,     learning_rate=1e-5 * replication_factor,     warmup_steps=total_steps // 4,     evaluation_strategy="steps",     eval_steps=total_steps,     max_steps=total_steps,     save_strategy="steps",     save_steps=total_steps,     logging_steps=25,     dataloader_num_workers=16,     dataloader_drop_last=True, )

Then, we just need to pass all of this together with our datasets to the IPUSeq2SeqTrainer class:

trainer = IPUSeq2SeqTrainer(     model=model,     ipu_config=ipu_config,     args=training_args,     train_dataset=train_dataset,     eval_dataset=eval_dataset,     data_collator=DataCollatorSpeechSeq2SeqWithLabelProcessing(processor),     compute_metrics=lambda x: compute_metrics(x, processor.tokenizer),     tokenizer=processor.feature_extractor, )

To gauge the improvement in WER, we run an evaluation step before fine-tuning.


All that remains is to fine-tune the model! The fine-tuning process should take between 6 and 18 minutes, depending on how many replicas are used, and achieve a final WER of around 10%.


Fine-tuning on IPUs in non-Paperspace environments

To run the Whisper Small fine-tuning demo using IPU hardware other than in a Paperspace Gradient Notebook, you need to have the Poplar SDK enabled.

Refer to the Getting Started guide for your system for details on how to enable the Poplar SDK. Also refer to the Jupyter Quick Start guide for how to set up Jupyter to be able to run this notebook on a remote IPU machine.


In this notebook, we demonstrated how to fine-tune Whisper for multi-lingual speech recognition and transcription on the IPU.

We used a single replica on a total of four IPUs. To reduce the fine-tuning time, more than one replica, hence more IPUs are required. On Paperspace, you can use either an IPU Pod16 or a Bow Pod16, both with 16 IPUs. Please contact Graphcore if you need assistance running on larger platforms.

For all available notebooks, check IPU-powered Jupyter Notebooks to see how IPUs perform on other tasks.

Have a question? Please contact us on our Graphcore community channel.