Small language models, high performance: DeBERTa and the Future of NLU

Large Language Models (LLMs) such as OpenAI’s GPTs, Google’s Bard, and Meta’s LLaMA have popularised the concept of Natural Language Processing in AI and are now finding a growing number of commercial uses.

While much attention has been focused on the generative capabilities of such models, many NLP applications require Natural Language Understanding (NLU), rather than generation.

NLU is used in chatbots and virtual assistants, enabling them to understand user queries and navigate conversation flow. It also plays a critical role in search engines, where it helps to retrieve relevant information based on user queries.

The healthcare industry is increasingly deploying NLU to extract information from patient records and assist doctors in making more accurate diagnoses.

Perhaps because certain high-profile LLMs have demonstrated broad capabilities, some users are turning to them for NLU applications, but this may prove to be computational overkill.

In this article, we'll explore how smaller models such as Microsoft’s DeBERTa can achieve surprising performance on NLU tasks.

DeBERTa

NLU inference on IPUs

Beyond text-based Interfaces

The usefulness of NLP-powered systems has advanced hugely in recent years, however there are limitations to primarily text-based interfaces such as chatbots and virtual assistants:

Communicating by text alone can be challenging when dealing with complex information, such as medical diagnoses or financial advice, which may require visual aids such as diagrams, images, graphs, or maps.

Conveying emotion and tone through text is also difficult and can lead to misunderstandings or misinterpretations, particularly in customer service applications.

Finally, there is the issue of cognitive overload, which occurs when users are presented with too much text at once, leading to confusion and frustration.

To address these problems, NLP applications can incorporate other forms of media, such as images, graphs, and maps, into their UI/UX design.

NLU models play a critical role in this process by creating the structured data formats required for these designs.

For example, a weather app could use a chatbot interface that also incorporates graphs and maps to convey information more effectively, with NLU models extracting relevant information from user input and converting it into a structured format.

Cost-Efficiency of Smaller Models

Large, complex LLMs like GPT-3/4 and T5 aren't always the most efficient for these sorts of tasks. While the simplicity of setting them up can be seductive, they are often computationally expensive which, of course, translates into being financially expensive.

Using smaller models like DeBERTa can lead to significant savings while maintaining high levels of accuracy. In many cases, these smaller models can even outperform larger models on specific tasks.

Because smaller models require less computational power to train and use, thet can be faster and more accessible. The smaller size of these models also allows them to be deployed on smaller devices, making them ideal for edge computing and other resource-constrained environments.

DeBERTa

One of the most popular Natural Language Understanding architectures is DeBERTa, a transformer-based model that achieves state-of-the-art results in a variety of NLU tasks, including question answering, natural language inference, and sentiment analysis.

DeBERTa is a more efficient variant of the popular language model BERT, specifically designed for Natural Language Understanding tasks. It addresses some of BERT's limitations, such as the inability to model long-range dependencies and the lack of robustness to noisy text.

DeBERTa outperforms BERT across the board and exceeds the NLU performance of the majority of larger and more recent language models.

One reason for DeBERTa's success is its novel architecture, which allows for better attention across the input sequence through techniques such as attention factors and relative position bias. This helps DeBERTa achieve high accuracy with fewer parameters.

It is believed that on many NLU tasks - such as SQuAD - bidirectional encoders, as adopted by DeBERTa, considerably outperform the left-to-right decoders used in the GPT models [1]

On benchmark datasets such as SuperGLUE, DeBERTa also outperforms larger, more complex models such as GPT-3 and T5, while using a fraction of the number of parameters.

To try out DeBERTa-Base inference for yourself for free by launching the Paperspace Gradient Notebook, powered by Graphcore IPUs.

DeBERTa

NLU inference on IPUs

[1] BART: Denoising Sequence-to-Sequence Pre-training for Natural
Language Generation, Translation, and Comprehension – Mike Lewis et. Al

SMA‌LL LANG‍UAG‍E MODELS, HIG‌H PER‌FO‌‍R‍MANC‍E: D‍EB‍ERTA‍ A‍ND‌ THE FU‌TU‍R‌‍E O‌F NLU‌

DeBERTa

Beyond text-based Interfaces

Cost-Efficiency of Smaller Models

DeBERTa

DeBERTa

What to read next

J‌U‌NE P‍A‍P‍ERS: GR‌‍AD‍IENT NOR‌‍MS, LLM R‌EA‌SONING‍ A‍ND VID‌‍EO G‍ENERA‍TIO‍N

MAY P‍A‍P‌ER‍S: P‍A‍RALLEL SC‌‍ALING, EV‍O‍LV‍ING‌ C‌‍O‌‍D‍E, U‍ND‍ERSTAND‌‍ING‍ LLM REA‍SO‌‍NING‌‍

AP‍R‌‍IL P‌‍APER‌S: MO‍TIO‌N P‌RO‍MP‌TING, MAMB‍A R‍EA‍SONING‌‍ A‍ND MODELING‌‍ REWAR‌D‌S

Get the latest Graphcore news

Register your interest