10 min read

Arxiv Dives - Llama-2 Explained

Arxiv Dives - Llama-2 Explained

Every Friday at Oxen.ai we host a public paper club called "Arxiv Dives" to make us smarter Oxen 🐂 🧠. These are the notes from the group session for reference. If you would like to join the discussion live, sign up here.

The following are the notes from the live session. Feel free to follow along with the video for the full context.

Paper: https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

Published: July 19th 2023

Key Takeaways

Meta released a suite of language models ranging from 7 billion parameters to 70 billion parameters.

They released these models to compete with the closed source ChatGPTs of the world, to enable the community to build on their work, and contribute to the responsible development of LLMs.

The majority of the work was on optimizing the chat models for Safety and Helpfulness.

There are 6 models total, ranging in size. For each size there is a base model (no fine tuning) and a chat model (fine tuned to be an assistant)

As you can see as of October 13th, 2023, some of them have been downloaded over 1 million times.

Since the release, there has been an explosion of fine tuned models based on these base models in the open source community.


LLMs have shown great promise as highly capable AI assistants that excel in task from programming to creative writing.

“Auto-regressive transformers are pretrained on an extensive corpus of self-supervised data” = Neural Networks that predict the next word in a sequence on a large set of text.

They claim that none of the previous open source pretrained LLMs have been substitutes for closed “product” LLMs.

The steps that it take to fine tune and align LLMs with human preferences can be very costly and are often not transparent or reproducible, limiting progress in AI alignment research.

When they talk about alignment, they mean Safety and Helpfulness, which we will dive into some examples of from the paper.

Updates from Llama-1

  1. Increase the size of the pretraining corpus by 40%
  2. Doubled context length
  3. Grouped query attention
  4. Fine tuned version for chat

Overall Process, very similar to the InstructGPT paper we went over already, just we actually get to play with these models.


They took a few measures to improve the performance during pretraining:

  1. More robust data cleaning
  2. Updated data mixes or distributions
  3. Trained on more tokens
  4. Doubled the context length
  5. Used grouped query attention (GQA)

2 Trillion tokens was a good “performance-cost” trade off for them.

They upsampled the most factual sources in an effort to increase knowledge and dampen hallucinations. This means they literally just duplicated sentences from respected sources to let the model see them over and over again.

The biggest architectural changes are context length and grouped-query attention.

They performed some ablation tests on context length:

Increase Context Length allows the model to process information.

This is useful for supporting longer chat histories. For example, under the hood, chat applications may simply be a long prompt concatenated history. Each message that comes in, the prompt gets longer.

Prompt 1:

You are a large language model called AI …

Human: Hi
AI: Hello

Prompt 2:

You are a large language model called AI …

Human: Hi
AI: Hello
Human: What is the weather like in LA during the winter?

Prompt 3:

You are a large language model called AI …

Human: Hi
AI: Hello
Human: What is the weather like in LA during the winter?
AI: 70 and Sunny as it is all year long!

This obviously gets expensive token wise, and is an active area of research of how to better improve.

I’ve seen some work on using Memory Networks or RAG to selectively choose which documents to put in context or summarize the context before prepending. Does anyone in the audience know of any methods that improves on this?

Grouped-Query Attention

For larger models, the attention computation can get quite expensive. This mechanism helps speed up the computation, there’s a whole paper on it if you want to learn more.

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi…

The main idea behind Grouped-Query Attention is that since larger context windows are more computationally expensive (O(n^2)) you need a mechanism to speed it up if you have a larger context window.

I interpreted the ablation study results as since it improves the speed for a large context, and doesn’t affect performance that much, they keep it.

Training Hardware

This is pretty insane.

More info on the research cluster: https://ai.meta.com/blog/ai-rsc/

Some fun stats

  • 3.3 million GPU hours on a cluster of 16,000 NVIDIA A100s (which each have 80GB of VRAM)
  • You can't even rent a single A100 from LambdaLabs, but if you wanted a machine with 8 of them, it would be $12 an hour, so back of the envelope math is 3,311,616*12/8 = $5 million dollars and 3,311,616/8/24/365 = 47 years to run on an imaginary Lambda machine of 8 GPUs
  • Facebook has high speed network connections between all 16,000 of their GPUs to make this even possible

They also calculate the carbon footprint of training and say they offset the emission cost with their sustainability program.

Note: this is just for pre-training of the base models, fine tuning is less expensive, because there is less data. I have fine tuned Llama-2 7B overnight on my laptop GPU (16GB Nvidia Card).

Pre-Trained Model Evaluation

  • Code - HumanEval dataset
  • Commonsense Reasoning - PIQA, HellaSwag, WinoGrande, ARC, OpenBookQA, CommonSenseQA datasets
  • World Knowledge - NaturalQuestions, TriviaQA
  • Reading Comprehension - SQuAD
  • MATH - GSM8K and MATH

Depending on the dataset they use different N-shot examples in the prompt.

An example of 3-shot prompting on question answering might be:

This is an exam with a list of trivia questions and answers. Fill in the following with the correct answers.

Q: What is the capital of the United States?
A: Washington DC

Q: Who invented the Segway?
A: Dean Kamen

Q: What is the fastest land animal?
A: Cheetah

Q: {question}

Compared to other Open-Source base models:

Compare to Closed-Source models

Fine Tuning

They spend several months of research iterating and fine tuning these models.

Overall, the fine tuning is a very similar process to the InstructGPT paper, which we have a full video and post on.


They first do a Supervised Fine Tuning (SFT) then Reinforcement learning from Human Feedback (RLHF)

A reminder of the InstructGPT flow:

Supervised Fine Tuning

Getting started they used publicly available datasets for instruction fine tuning. Not sure the exact datasets, maybe it’s in the appendix.

They quickly found out that “Quality is all you need”. In other words, 3rd party fine tuning data is available from many sources, but they have insufficient diversity and quality. Especially for dialogue style instructions.

They set aside “millions of examples from third party datasets” and focused first on collecting several thousand high-quality examples from their own annotation efforts, and results significantly improved.

They stopped annotating at 27,540 examples for their dataset.

As far as I can tell, they did not release this dataset, as it is probably one of the main secret sauces, and took a lot of effort. Kind of disappointing because there is a lot of great work here, but the devil is in the details.

Giving the model weights is a great jumping off point to continue to fine tune your own models, but not giving the data does not make it very reproducible.


In this case they have human annotators select which of two model outputs they prefer. This data is then used to train a reward model, that learns how the humans rank the outputs, and then can automate preference decisions.

They went with a binary decision over the A > B > C > D that we saw in instruct GPT paper, because they say it maximized the diversity of the collected prompts.

For llama-2 they decided to train two separate reward models, one for helpfulness and one for safety. This is because others have found that helpfulness and safety can be trade offs at times.

It’s really hard to draw the line between what is a sensitive keyword vs what is a benign question.

Iterative Fine-Tuning

They do not just fine tune once, but they continued to receive more and more labeled data, so they were able to train successive versions of the RLHF models.

We saw this technique also used during the segment anything paper, where continuing to train and use the previously trained model as a bootstrap can improve performance.

You can see the performance increase over time as they iterate back and forth on the data collection and retraining.


They did an analysis of the pre-training data to shed light on root causes of potential downstream issues and biases.

The main areas of the pretraining data they analyze are distributions of

  • Languages
  • Demographic representations
  • Toxicity+

Safety Fine-Tuning

What does safety mean in this context? There are a few “prompt attack vectors” they define.

  1. Illicit and criminal activities
    1. Terrorism, theft, human trafficking
  2. Hateful and harmful activities
    1. Defamation, self-harm, eating disorders, discrimination
  3. Unqualified Advice
    1. Medical advice, financial advice, legal advice)

Context Distillation - prompt the LLM with context that it should be safe, get the response, then strip the prompt and incorporate this into the training data.

They show that adding more safety data improves the Reward Model safety score, while keeping helpfulness relatively stable.


A big takeaway from the research was that reinforcement learning was an essential step to foster “synergy” between humans and LLMs through the annotation process.

Humans are typically better at comparing two responses than writing them from scratch. The analogy here is that “while we may not all be accomplished artists, our ability to appreciate and critique art remains intact.”

Supervised data in the form of humans writing down responses, may no longer be the gold standard. You either need the best humans, or train the LLM to become better than us and just rely on humans for judgement.

Fun "Temporal Perception" Examples

Tool Usage


There are still a lot of open questions about bias, toxicity, private data leakage, and potential for malicious uses of these LLMs.

Overall this work I think is important to give to the open source community to get more eyes on the problem. With closed source models it centralizes the power and decision making, while open source leads to more transparency and gives more people a crack at solving some of the problems.

That being said, there are obvious risk factors, and as with all technology can be used for good and bad. I think cyber crime and cryptography are good examples of the battle back and forth of attacker and defender. In general, open source code that everyone can contribute to and use tends to be more battle tested than a closed source implementation that only a few eyes are on.

I personally believe that not only open sourcing the model, but having attribution and open source data is a big step towards making these models as safe and helpful as possible. If we can audit the data that went into each model, we can better understand it’s behavior, and make more informed decisions going forward.

If you enjoyed this dive, please join us next week!

Arxiv Dives with Oxen.ai · Luma
Hey Nerd, join the Herd!... for a little book/paper review. Make sure to also join our Discord here (https://discord.gg/s3tBEn7Ptg) to share recommendations for future reads and more…

All the past dives can be found on the blog.

Arxiv Dives - Oxen.ai
Each week we dive deep into a topic in machine learning, data management, or general artificial intelligence research. These are notes from a live reading group we do every Friday. Captured for future reference.

The live sessions are posted on YouTube if you want to watch at your own leisure.

Each week we dive deep into a topic in machine learning or general artificial intelligence research. The sessions are live with a group of smart Oxen every Friday. Join the discussion: https://lu.ma/oxenbookclub

Best & Moo,

~ The herd at Oxen.ai

Who is Oxen.ai?

Oxen.ai is an open source project aimed at solving some of the challenges with iterating on and curating machine learning datasets. At its core Oxen is a lightning fast data version control tool optimized for large unstructured datasets. We are currently working on collaboration workflows to enable the high quality, curated public and private data repositories to advance the field of AI, while keeping all the data accessible and auditable.

If you would like to learn more, star us on GitHub or head to Oxen.ai and create an account.