September 18, 2023

Making LLMs Smarter—Training, Fine-Tuning, and Embedding

Making LLMs Smarter

We’re living in an era where machines can write like humans thanks to large language models (LLMs) like OpenAI’s ChatGPT. These AI models are powering everything from customer service chatbots to instructional design assistants, and the hype around them is growing by the day.

But have you ever stopped to wonder how these AI models come to “know” so much? How do they master the art of conversation, and most importantly, how come they are getting smarter and smarter? To answer these questions, we need to explore the key components of LLM training and optimization.

It All Starts With Data—Lots and Lots of Data

The training of LLMs is a fascinating process that commences with an extensive and diverse dataset. This dataset can be derived from various sources like books, websites, articles, and open datasets. Platforms like Google Dataset Search, Data.gov, or Wikipedia often serve as popular public sources for this data.

For example, GPT-2, a predecessor to ChatGPT, primarily uses the WebText dataset, comprised of around 8 million web pages collected from the internet, as a starting point. 

Once the dataset is gathered, it’s not quite ready to be used—like a diamond, it needs to be polished first. This involves cleaning and prepping the data, which can include converting the dataset to lowercase, removing stop words, and breaking down the text into smaller pieces, called tokens, which are essentially the building blocks of language.

Training LLMs to Turn Data into Dialogue

Remember those tokens we mentioned earlier? In essence, LLMs like ChatGPT are trained to predict the next token in a sentence. This training involves a huge number of internal parameters, also referred to as weights, that define the strength of connections between the artificial neurons that make up the model. 

According to OpenAI, GPT-3, released in 2020, has 175 billion of these parameters. During the training process, parameters are adjusted in an iterative fashion to minimize the loss function. 

In a nutshell, the loss function is a measure of how far off the model’s predictions are from the actual results. Every time the model makes a prediction during training, it computes the loss function to quantify its mistakes.

Think of it as the score on a dartboard; the bullseye is the true, correct result. Each throw of the dart represents a prediction the model makes, and the loss function is equivalent to the distance from the dart to the bullseye. The goal is to get as close to the bullseye as possible, and the way to do that is to minimize that distance, or the loss.

The parameters of the model are then adjusted based on this loss. A process known as back-propagation calculates the gradient of the loss function with respect to the parameters, indicating how much a small change in parameters would affect the loss.

Over many iterations—often millions or even billions—this training process gradually improves the model’s ability to predict the next token in a text string, essentially enabling it to generate more accurate, contextually relevant, and human-like text.

Taking LLMs from Good to Great With Fine-Tuning

While the initial training of LLMs involves millions or billions of iterations, refining these models for a specific task or domain requires a more precise technique. This is where fine-tuning comes into play.

Think of fine-tuning as the final adjustment of a grand piano or a precision tool. It’s the process of refining and enhancing a pre-trained LLM to perform specific tasks or cater to a particular domain more effectively. 

For example, the fine-tuning process helps with:

Fine-tuning starts with introducing the task-specific dataset to the pre-trained LLM. Once again, the goal is to optimize the model’s parameters to minimize the loss function. Several different tasks can be included in the fine-tuning process, including question-answering, paraphrasing, and reading comprehension. 

Upon completing the fine-tuning process, the next step is to evaluate the performance of the model using a test dataset, which is different from the data used in training and validation. This critical phase ensures that the model is ready to face real-world data and perform the specific task efficiently. 

Embedding LLMs With New Knowledge

Embeddings are numerical vectors that carry the semantic essence of a pice of text. If you think of text as a rich tapestry of meaning, embeddings are like its blueprint in a language that machines can understand. 

The users of OpenAI’s LLM, for example, can create embeddings thanks to the provided API endpoint, and they can then use them to teach the AI model they’re working with something new.

In fact, embeddings can be stored in a vector database that serves as a repository for these high-dimensional representations of text data. By storing these embeddings, a vector database can provide quick and easy access to this rich, semantic information, making LLMs smarter in several ways:

In short, embeddings stored in vector databases play a crucial role in enhancing the capabilities of LLMs by facilitating a more efficient, scalable, and nuanced understanding of text data. 


As we delve into the era of advanced AI models, understanding the driving forces behind their increasing intelligence and adaptability becomes crucial. This intelligence is not a product of chance, but a result of meticulous processes like data cleaning, training with large datasets, fine-tuning for specific tasks, and embedding new knowledge through semantic vectors stored in databases. 

These components work in harmony, giving LLMs like ChatGPT the ability to understand and generate human-like text, navigate domain-specific jargon, perform specific tasks, and continuously learn from new information. The future of AI models lies in harnessing these techniques, making them more accurate, adaptable, and intelligent, and thus, more impactful in the applications they serve.