19  Working with Large Language Models (LLMs)

Published

March 26, 2026

Keywords

LLMs, Prompt Engineering, Tokenization

19.1 Working with Large Language Models

19.1.1 Learning Outcomes

By the end of this section, you should be able to:

  • Develop and employ prompts for chat-based or code-based interactions with LLMs to get effective results.
  • Describe how Large Language Models (LLMs) generate responses.
  • Understand tokenization and token limits, and how model size affects outputs.
  • Explain the structure of Agents and how they differ from simple prompting.
  • Employ R to connect LLMs to tools and build simple agent loops.

19.1.2 References

Other References for additional exploration.

19.2 A Student’s Guide to Prompt Engineering

Note

This section was written based on a conversation with ChatGPT that started with a 156 word prompt and about 25 follow-up prompts to adjust, expand and refine content. Additional editing improved clarity and consistency as well as formatting and adjustments to code and code chunk options. References were checked and adjusted for accuracy and citations were added. Links to inline references were added.Any errors are my responsibility.

19.2.1 Learning Outcomes

By the end of this guide, you should be able to:

  • Explain what prompt engineering is and why it matters when working with AI tools.
  • Describe how Large Language Models (LLMs) generate responses.
  • Understand tokenization and token limits, and how model size affects outputs.
  • Construct effective first prompts for coding and data science questions.
  • Use follow-up prompts to refine, extend, or debug an AI’s response.
  • Critically evaluate responses to detect errors or hallucinations.

19.2.2 References

Other References for additional exploration.

19.3 Prompt Engineering

“Prompt engineering is the process of writing effective instructions for a model, such that it consistently generates content that meets your requirements.” @OpenAI (2025)

A “prompt” can be a question, a request for code, a set of instructions, or even an ongoing conversation with a Large Language Model (LLM).

  • Creating or “engineering” a prompt (or series of prompts) to produce the most accurate, useful, and relevant output possible to meet your goals is still a mix of art and science.

Prompt engineering builds on an understanding of how LLMs work (not how they are trained) to create prompts that are effective for your purposes. Characteristics of LLMs can shape effective prompts.

  • Non-determinism: LLM responses vary because they are generated from probabilities in a high-dimensional space. Small wording changes can produce very different outputs.
    • Asking what is the “most positive” versus the “least negative” sentiment of text.
  • Context Sensitivity: LLMs don’t “understand” in the human sense; they generate responses based on patterns in data. How you ask a question strongly influences the quality of the answer.
    • Specifying “Give me R code” versus “Explain in plain English” leads to very different outputs.

You can apply guidelines and best practices to improve your chances of getting useful results consistently while building your understanding and skills.

  • Efficiency: A well-crafted prompt reduces the need for repeated clarifications.
  • Accuracy: Clear context, guidance, and constraints can help minimize errors or hallucinations.
  • Improved Collaboration: prompt engineering can be seen as refining or debugging your question prior to collaborating with others or the LLM.
  • Multi-language skill: The same techniques apply whether you’re generating R code, Python code, documentation, or explanations.

In short: Prompt engineering is about learning how to “talk to the model” effectively so it can become a productive tool for your goals rather than a source of confusion.

19.4 How LLMs Respond to Prompts

LLMs (like ChatGPT, Claude, or Gemini) do not “understand” like humans. Instead, they:

  1. Predict the next token (word or subword) given your input and context.
  2. Use patterns from training data to approximate reasoning.
  3. Are sensitive to framing: wording, order, specificity, and constraints change the output.
  4. Can hallucinate: generate confident-sounding but false statements.

19.4.1 Tokenization

LLMs do not read raw text directly. Instead, they break text into tokens (smaller units such as words, subwords, or characters).

  • Tokens are not always whole words.
    • Example: "data science"["data", " science"] (2 tokens)
    • Example: "statistics101"["statistics", "101"] (2 tokens)
    • Example: "internationalization"["international", "ization"] (2 tokens)
  • This is similar to but not the same as Lemmatization which reduces words to their base form (common in NLP preprocessing, not in LLM tokenization).
    • "running""run"
    • "better""good"

Here is a toy example of token splitting.

                  Word                  Tokens
1                 data                    data
2        statistics101        statistics | 101
3 internationalization international | ization
4              running              run | ning
5               better                  better

19.4.2 Token Limits and Conversation Context

  • Each model has a maximum token limit (prompt + response). Examples:
    • Small models: 2k–4k tokens
    • Medium models: 16k–32k tokens
    • Large models: up to 200k tokens in advanced systems
  • Context accumulation:
    • Most LLMs add all previous prompts and responses in a conversation back into each new prompt as context for the new prompts.
    • The model itself does not maintain an explicit memory of your prior prompts and its responses so adding them in this way “simulates” a conversation while providing more context for the next response.
    • Over time, long conversations with lots of code in prompts and responses can grow very large and approach the token limit.
    • If the limit is exceeded, some older context may be truncated, which can lead to loss of information or incorrect responses, or you will be forced to start a new conversation and all context will be lost.
Tip

Best Practices to avoid exceeding your token budget.

  • If your prompt is long (e.g., pasting a large dataset), summarize it, sample it, or provide only the key features.

  • Start a new conversation for a completely new topic. This prevents context from growing too large and ensures the model doesn’t mix unrelated topics.

19.4.3 Tokens are Converted to Numbers

  • LLMs do not operate on tokens as text. Each token is mapped to a numerical vector (an embedding).
  • Embeddings are high-dimensional vector representations of tokens or text that capture semantic meaning.
  • LLM embeddings are vectors with hundreds or thousands of dimensions, not just 2 or 3, and each dimension encodes some aspect of meaning, context, or syntactic/semantic feature.
    • Different models can use different embedding schemes.
  • The model performs mathematical operations on these vectors to predict the next token based on mathematical similarity or “closeness.”

Example:
- "dog"[0.12, -0.03, 0.88, ...]
- "puppy"[0.14, -0.01, 0.91, ...]
- "car"[0.80, 0.20, -0.05, ...]

19.4.4 Measuring Closeness

  • Similarity between the embedding vectors is measured using metrics like cosine similarity.

\[ \text{cosine similarity} = \frac{\vec{A} \cdot \vec{B}}{\|\vec{A}\| \|\vec{B}\|} \]

  • Cosine similarity = 1 -> vectors point in the same direction (highly similar meaning).

  • Cosine similarity = 0 ->vectors are orthogonal (unrelated meaning).

  • Embeddings allow LLMs to “understand” similar words or phrases as they have embeddings “close” together in vector space.

  • These metrics are used for:

    • Semantic search (finding relevant text)
    • Retrieval-Augmented Generation (RAG)
    • Clustering or similarity calculations

19.4.5 Visualizing Embeddings in R

This example shows how words with similar meaning might cluster together in the embedding space.

  • “dog”, “puppy”, and “cat” cluster closely, reflecting semantic similarity.
  • “car” and “bicycle” are farther away, showing unrelated meaning.
  • In real embeddings, vectors are high-dimensional, but this 2D example illustrates the concept.
  • Cosine similarity measures closeness mathematically in high dimensions, even though they can’t be plotted.

Warning

Important nuance:

  • These similarity measures are inherently fuzzy.
  • Minor variations in your prompt—such as negating a sentence, reordering words, or changing context—can result in large differences in the output, even if the overall meaning seems similar to a human reader.

Example:
- Prompt 1: "List three common R functions for plotting a histogram."
- Prompt 2: "List three uncommon R functions for plotting a histogram." - Prompt 3: "List three popular R functions for plotting a histogram."

Even though only one word changes, the model may:

  • Suggest completely different functions.
  • Reorder examples differently.
  • Include/exclude certain packages.

This happens because embeddings map text to points in a continuous space, and the model predicts outputs based on small differences in those positions.

Takeaway: Always experiment with multiple phrasings and verify results. Treat embeddings and similarity measures as guides, not exact truth.

19.5 Guidelines for Effective Prompts

Good prompts share a few common traits: they are clear, contextual, and iterative. Here are strategies to improve your interactions with AI tools:

  • Be Specific
    Clearly describe what you want the AI to do. Include the programming language, the type of output, and the level of detail you expect.
    Example:
    • Weak: “Plot the data.”
    • Strong: “Write R code using ggplot2 to create a line plot of revenue by year with labeled axes and a descriptive title.”
  • Give Context
    Provide background information such as the data structure, libraries, or your end goal. This reduces ambiguity and helps the AI tailor its response.
    Example:
    • “I have a pandas data frame with columns city and population. Please write Python code using seaborn to create a bar chart of population by city.”
  • State Constraints
    Specify limitations or requirements for the response, such as the format, length, or assumptions.
    Example:
    • “Give me only the R code, no explanations.”
    • “Limit the answer to a single ggplot2 figure.”
    • “Assume the data frame has no missing values.”
  • Iterate
    Think of prompting as a conversation, not a one-shot request. Start simple, review the output, and refine with follow-up prompts.
    Example:
    • First prompt: “Write Python code to read a CSV file and display the first five rows.”
    • Follow-up: “Now extend this code to calculate the mean of all numeric columns.”
    • Next follow-up: “Format the summary as a neat table.”
  • Verify
    Never assume the AI is correct. Check the output against your own knowledge, official documentation, or by running the code. Be alert for hallucinations (nonexistent functions, incorrect syntax, or misleading explanations).
    Example:
    • If the AI suggests robust_cor() in R, search the documentation. If it doesn’t exist, redirect:
      “That function doesn’t exist. Could you instead use Spearman’s correlation or show me how to fit a robust regression with MASS::rlm()?”
  • Adjust for Creativity or Accuracy
    You can control how wide-ranging or precise the response should be by adjusting your wording.
    Example:
    • Creative: “Show three different ways in R to visualize a distribution.”
    • Accurate: “Show the single most standard ggplot2 approach for plotting a histogram of a numeric variable.”
  • Assign a Role
    Guide the style of the response by telling the AI who it should act as.
    Example:
    • “You are a data science tutor. Explain correlation to a beginner and include an R code example.”
    • “You are a coding assistant. Provide concise Python code with no explanations.”

19.5.1 Example: Making a Good First Prompt

Weak prompt:
> Plot the data.

Improved prompt:
> I have a data frame in R with columns year and revenue. Please write R code using ggplot2 to create a line plot of revenue by year, with labeled axes and a title.

19.5.2 Example: Building a Conversation

First Prompt:
> Write Python code to read a CSV and summarize the first five rows.

LLM Output:
Code using pandas.read_csv() and df.head().

Follow-Up Prompt:
> Please extend your code to also compute the mean of all numeric columns and print the result.

LLM Output:
Adds df.mean(numeric_only=True).

Next Follow-Up:
> Could you format the summary as a table with column means below the head output?

The model refines until the solution fits your needs.

19.5.3 Example: Checking for Hallucinations

Prompt:
> Write R code to compute a robust correlation coefficient.

LLM Output:
Provides code with a function robust_cor() (a hallucination as the function does not exist).

Student Check: - Get’s an error message and looks up if robust_cor() exists in R.
- If not, ask:
> That function doesn’t seem to exist. Could you instead show how to use MASS::rlm() or another real package to compute robust correlation?

The key is to verify and redirect.

19.5.4 Example: Large vs. Small Model Responses

The same prompt can yield different results depending on the size of the model.

Larger models (billions of parameters) generally produce more detailed and accurate answers, while smaller ones may be faster but less reliable.

Prompt:
> In R, how do I compute the correlation between two variables when the data has outliers?

Model Response
Large model (e.g., GPT-4, Claude Opus, Llama-70B) “One option is to use a robust correlation method. For example, you can use Spearman’s rank correlation in R: cor(x, y, method = "spearman"). Another approach is to fit a robust regression using MASS::rlm() if you want to downweight outliers. Both approaches reduce the influence of extreme values compared to Pearson correlation.”
Smaller model (e.g., GPT-3.5, Llama-7B) “You can use cor(x, y) in R. This computes correlation between two vectors.” (Note: does not mention outliers or alternatives like Spearman or robust regression.)

Takeaway:
- The large model recognizes the nuance (outliers) and suggests multiple valid approaches.
- The smaller model gives a quick but incomplete response.
- Lesson: Always check whether the model has considered your context.

19.5.5 Example: Specifying Roles

When prompting, you can assign a role to the AI which will adjust its response.

Prompt 1 (role: assistant):
> You are a coding assistant. Write R code to plot the distribution of a numeric variable.

Likely Response:
Straightforward R code using hist() or ggplot2::geom_histogram().

Prompt 2 (role: instructor):
> You are a statistics instructor. Explain to a beginner how to plot the distribution of a numeric variable in R, and include an example using ggplot2.

Likely Response:
A step-by-step explanation with annotated code — more teaching-oriented.

19.5.6 Example: Adjusting Creativity vs. Accuracy

LLMs can be “dialed” for creativity (diverse answers, new ideas) or accuracy (precise, deterministic answers).

  • This is often controlled by a setting called temperature (higher = more creative, lower = more predictable).
  • Even without changing system settings, you can influence style through your prompt wording.

Prompt (creative mode):
> Be imaginative. Show me three different R approaches to visualize the distribution of a variable.

Possible Output:
- Histogram (geom_histogram())
- Density plot (geom_density())
- Boxplot (geom_boxplot())

Prompt (accuracy mode):
> Provide the single most standard way in R using ggplot2 to visualize a variable’s distribution.

Possible Output:
One clean example using geom_histogram(), without alternatives.

Below we show a “standard” accurate plot of a distribution: If the AI had been asked in creative mode, it might instead show a density plot, violin plot, or boxplot.

library(ggplot2)
set.seed(42)

values <- rnorm(200, mean = 50, sd = 10)

ggplot(data.frame(values), aes(x = values)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Values (Accurate Mode)",
       x = "Value", y = "Count") +
  theme_minimal()

You can shape the AI’s “persona”(assistant, instructor, critic, tutor) and also control breadth vs. precision of answers depending on their goals.

19.6 Summary

Here is a quick reference sheet you can use when working with AI tools:

Strategy What It Means Example Prompt
Specify Role Tell the AI who it should act as (assistant, instructor, critic, tutor). “You are a statistics instructor. Explain correlation to a beginner with examples in R.”
Give Context Provide background: dataset, libraries, goals. “I have a data frame in R with columns year and revenue. Use ggplot2 to plot revenue by year.”
State Constraints Limit length, format, or assumptions. “Give me Python code only, no explanations, using pandas and seaborn.”
Iterate Use follow-up prompts to refine or extend. “Now add labels to the axes.”
Verify Check outputs against your knowledge or documentation. “That function doesn’t exist in R. Show me an alternative from MASS or robustbase.”
Creativity vs. Accuracy Ask for one best method (accuracy) or multiple diverse methods (creativity). “Show three different ways in R to visualize a distribution.”
Check for Hallucination Be skeptical if the AI invents code/functions. Redirect if necessary. “I can’t find that function. Can you cite the package or suggest a real function?”

Prompt engineering is not about tricking the AI, but about effective communication.

Think of the AI as a partner:

  • You provide structure, clarity, and verification.
  • It provides suggestions, alternatives, and explanations.

With practice, you’ll learn when to ask for creativity, when to demand precision, and how to iterate toward a reliable solution.

19.7 Strengthening Prompt Engineering From Chat to Agent Workflows”

This short module is about transitioning from chat-based to code-based interactions with LLMs

It focuses on strengthening thinking about prompts in three areas that will be useful for using agents:

  • Structured prompt design
  • Iteration as a loop
  • Constraint prioritization

19.7.1 A Simple Prompt Template

The following template provides a consistent way to construct high-quality prompts.

A good prompt answers five questions:

  1. ROLE: Who should the AI act as?
  2. TASK: What exactly should it do?
  3. CONTEXT: What information does it need?
  4. CONSTRAINTS: What rules must it follow?
  5. OUTPUT FORMAT: What should the result look like?

Example:

  1. Role: You are a data science tutor.
  2. Task: Write R code to visualize revenue over time.
  3. Context: I have a data frame with columns year and revenue.
  4. Constraints:
    • Use ggplot2
    • Assume no missing values
    • Include axis labels and a title
  5. Output: Code only with no explanation

A tightly structured prompt has the following benefits:

  • Reduces ambiguity
  • Improves consistency which supports reproducibility
  • Makes debugging prompts easier
  • Provides a foundation for agent workflows

19.7.2 Iteration as a Loop

In chat-based interactions, iterating prompts can be thought of as a conversation.

However, when working with code-based interactions, it is more useful to think of iteration as a structured loop.

Iteration Pattern follows a common cycle

  1. Ask
  2. Evaluate
  3. Adjust
  4. Repeat

Example

  1. Ask: Write R code to plot revenue by year.
  2. Evaluate: Check for missing labels or no title
  3. Adjust: Add axis labels and a descriptive title.
  4. Repeat: Continue refining until the output meets requirements

This loop is the foundation of more advanced systems:

  • Prompting → manual loop
  • Agents → automated loop (plan → act → evaluate)

19.7.3 Constraint Hierarchy

Not all constraints are equally important. Focus on the critical few instead of the “messy many.”

Suggested: Priority Order

  1. **Output Format*
    • code vs explanation
    • table vs paragraph
  2. Tools / Libraries
    • ggplot2, pandas, etc.
  3. Assumptions
    • missing values
    • data types
  4. Scope
    • one solution vs multiple options
  5. Style
    • concise vs detailed

Example of weak versus strong:

  • Weak constraint: Make it clear and nice.
  • Strong constraint: Output only R code using ggplot2 with labeled axes and a title.

Example: Poor vs Structured Iteration

  • Poor Iteration: User repeatedly adds vague follow-ups such as “Make it better”, “Fix it” or “Change it a bit”
    • Results can be inconsistent outputs, drifting logic, confusion in tracking the logical flow.
  • Improved Approach:
    • Reset with structure:
      • You are a coding assistant.
      • Task: Create a ggplot line chart.
      • Context: Data frame with year and revenue.
      • Constraints: Use ggplot2, include labels and title
      • Output: Code only
    • Then iterate systematically using the loop.

19.7.4 Transition to Agents

These ideas extend directly into working with agents more effectively.

In prompt engineering, you structure instructions; in agent systems, you structure processes.

The same components apply:

  • clear tasks
  • explicit constraints
  • iterative refinement

The difference is that agents can:

  • use tools
  • execute actions
  • repeat the loop automatically

Prompt engineering is not just about wording.

It is about building a repeatable structure for interacting with AI systems.

This structure becomes the foundation for working with agents.

19.8 From Prompting to Agents in R”

This chapter is designed to follow a chapter on prompt engineering.

Students should already understand:

  • how to write clear prompts
  • how to specify constraints
  • how to iterate on responses

This chapter answers the next question:

How do we move from prompts to agents that can act?

20 Learning Objectives

By the end of this chapter, students should be able to:

  1. Explain what an AI agent is (beyond prompting)
  2. Describe the components of an agent system
  3. Use R to connect an LLM to tools
  4. Build a simple agent loop
  5. Critically evaluate agent outputs

21 Key Concept: From Prompt → Agent

In the previous chapter, you used prompts like:

  • “Summarize this dataset”
  • “Write R code to compute a mean”

Those are static interactions.

An agent adds:

  • tools
  • iteration
  • execution

21.1 Conceptual shift

Prompting Agents
Ask for output Achieve a goal
One step Multi-step
No execution Can act via tools

22 What is an Agent?

An agent is typically composed of:

  1. A model (LLM)
  2. Instructions (system prompt)
  3. Tools (functions it can call)
  4. State (conversation or memory)
  5. A loop (plan → act → evaluate)

This chapter focuses on making those pieces visible in R.


23 Setup

# install.packages("ellmer")
# install.packages("jsonlite")
# install.packages("tibble")
# install.packages("dplyr")

library(ellmer)
library(jsonlite)
library(tibble)
library(dplyr)

24 Step 1 — Baseline (Not an Agent)

chat <- chat_openai()
chat$chat("What is the mean mpg in mtcars?")

24.1 Discussion

  • The model answers
  • But it does not access real data
  • It may hallucinate

25 Step 2 — Add Tools (Core Idea)

25.1 Tool: Inspect dataset

inspect_data <- function(data_name = "mtcars") {
  df <- get(data_name, envir = .GlobalEnv)
  tibble(
    column = names(df),
    class = vapply(df, function(x) paste(class(x), collapse = ", "), character(1))
  ) |> toJSON(pretty = TRUE, auto_unbox = TRUE)
}

25.2 Tool: Compute mean

mean_numeric_column <- function(data_name = "mtcars", column_name) {
  df <- get(data_name, envir = .GlobalEnv)

  if (!column_name %in% names(df)) {
    return(toJSON(list(error = "Column not found"), auto_unbox = TRUE))
  }

  result <- mean(df[[column_name]], na.rm = TRUE)

  toJSON(list(mean = result), auto_unbox = TRUE)
}

26 Step 3 — Tool-Enabled Agent

agent <- chat_openai(
  system_prompt = "You are an R assistant. Use tools when needed.",
  tools = list(
    inspect_data = inspect_data,
    mean_numeric_column = mean_numeric_column
  )
)
agent$chat("Find the mean mpg in mtcars.")

26.1 Key Idea

Now the model can:

  • decide what to do
  • call a function
  • use real data

27 Step 4 — Make the Agent Loop Explicit

We now force the structure:

agent$chat(
  paste(
    "Follow these steps:",
    "1. Plan",
    "2. Inspect data",
    "3. Compute",
    "4. Explain",
    "Question: What is the mean mpg?"
  )
)

27.1 Discussion

This is the core agent loop:

  • plan
  • act
  • evaluate

28 Step 5 — Manual Loop (Critical for Understanding)

planner <- chat_openai()

plan <- planner$chat("Create a 3-step plan to compute mean mpg from mtcars.")
plan
mean(mtcars$mpg)
[1] 20.09062
planner$chat("Interpret the value for a student.")

28.1 Key Insight

You just simulated an agent manually.


29 Student Exercise

29.1 Task

Build a mini agent that can:

  1. Identify numeric columns
  2. Compute mean for a chosen column
  3. Explain the result

29.2 Requirements

  • Define at least one new tool
  • Use structured prompting
  • Show the plan before execution

30 Instructor Notes (Do Not Skip)

30.1 What students struggle with

  1. They assume the model “knows” the data
  2. They skip validation
  3. They do not constrain tool use

30.2 What to emphasize

  • tools provide truth
  • prompts provide structure
  • loops provide robustness

30.3 Key teaching moment

Ask:

“What happens if the tool is wrong?”


31 Comparison to Prior Chapter

Prompt Engineering Agents
Write better instructions Build systems
Focus on wording Focus on workflow
Output quality Process quality

32 Environment Recommendation

32.1 Use RStudio when:

  • minimizing setup time
  • reinforcing existing workflows

32.2 Use Positron when:

  • demonstrating modern AI workflows
  • comparing chat vs inline vs agent

32.3 Suggested approach

  • Teach this chapter in RStudio
  • Show a short demo in Positron afterward

33 Extension (Optional)

Students can extend the agent to:

  • compute medians
  • run regressions
  • generate plots

34 Final Takeaway

An agent is not just a better prompt.

It is a system where:

  • a model makes decisions
  • tools execute actions
  • results feed back into reasoning

Understanding this structure is more important than any specific tool.