17  Python in R

Published

November 6, 2024

Keywords

python, reticulate, numpy, pandas, matplotlib, seaborn

The code chunks for Python and R have different colors in this chapter.

  • Code chunks for Python use this color.
  • Code chunks for R use this color.

17.1 Introduction

17.1.1 Learning Outcomes

  • Create strategies for integrating Python capabilities into an R workflow.
  • Use Python in RStudio and R.
    • Setup the RStudio environment for Python.
    • Use the Python REPL in RStudio.
    • Use Python Code Chunks and Python scripts in R.
    • Recognize a Few Differences between Coding in R and Python.
  • Employ basic Python and NumPy arrays.
  • Read in and manipulate data with {Pandas}.
  • Visualize data with {Matplotlib} and {Seaborn}.

17.1.2 References:

17.1.2.1 Other References

17.1.3 Posit Philosophy on R and Python

R has its origins in being a high-level language that enables users to access capabilities in other languages, e.g., SQL, Spark, C++, bash, etc..

  • R packages such as {dbplyr}, {sparklyR}, {shiny}, and {rcpp}, provide functions to enable users to choose the right capabilities for their task while hiding or wrapping the details of the other language.
  • Non-scientific surveys show many data scientists use R and Python with a preference for using R for data visualization and statistical analysis and using Python for large scale data transformation and machine learning.
  • It makes sense to enable R users to incorporate Python capabilities into their work and vice versa.

The {reticulate} package was developed by Posit (nee RStudio) to enable R users to incorporate python capabilities into their work.

  • Starting with version 1.2 in 2018, RStudio Desktop has included features to make it easier to blend R and Python in the same project.

If you are working in a pure Python project, RStudio recommends using one of popular IDEs for Python, as they have more extensive support for a broad variety of python capabilities. These include:

Note, you can also run some R from inside python using the {rpy} package.

If working on a team with both R and python users, or combining R and Python in the same project, consider using Quarto instead of R Markdown.

17.1.4 Python Modules, Libraries, Packages, and Frameworks

Python uses a modular programming structure like R.

  • Users can import a variety of packages or libraries to gain access to capabilities.
  • However, the terms may be used slightly differently.

We’ll use the terms library and packages interchangeably, but to some there is a difference:

  • A library is an umbrella term referring to a reusable chunk of code.
    • Usually, a Python library contains a collection of related modules and packages.
    • Actually, this term is often used interchangeably with “Python package” because packages can also contain modules and other packages (sub-packages).
    • However, it is often assumed that while a package is a collection of modules, a library is a collection of packages.
    • Using R as an example, the {tidyverse} package could be considered a “library” of packages while {ggplot2} could be considered a package focused on a specific set of capabilities.
    • See Python Modules, Packages, Libraries, and Frameworks.

17.2 Setting Up Python for RStudio using the {reticulate} package

If you want to incorporate some Python into your R project, you can use RStudio and the {reticulate} package.

  • This requires installing the {reticulate} package as well as a version of Python usable by {reticulate}.
Warning

Configuring RStudio, {reticulate} and python to work together can be tricky depending upon how many varieties of python are on your machine and where they are located.

Tip

There are numerous, old, StackOverflow posts about issues with configuring reticulate and python.

Start with the most recent {reticulate} vignettes in the references to troubleshoot.

The following steps should establish a specific environment for the miniconda python environment specifically for use with {reticulate}.

  • Use the Console to install the {reticulate} package.
```{r}
#| eval: false
#| messages: false
install.packages("reticulate")
```
  • Load the package with library().
```{r}
library(reticulate)
```
  • Check your python configuration.
```{r}
py_config()
```
python:         /Users/rressler/Library/r-miniconda-arm64/envs/r-reticulate/bin/python
libpython:      /Users/rressler/Library/r-miniconda-arm64/envs/r-reticulate/lib/libpython3.10.dylib
pythonhome:     /Users/rressler/Library/r-miniconda-arm64/envs/r-reticulate:/Users/rressler/Library/r-miniconda-arm64/envs/r-reticulate
version:        3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]
numpy:          /Users/rressler/Library/r-miniconda-arm64/envs/r-reticulate/lib/python3.10/site-packages/numpy
numpy_version:  1.26.4

NOTE: Python version was forced by RETICULATE_PYTHON
  • To install a version of Python usable by {reticulate}, run in the Console Pane:
```{r}
#| eval: false
reticulate::install_miniconda(force = TRUE)
```
  • This will also install multiple python packages to include {numpy}.

If you have multiple python environments on your computer, you will probably want to configure a virtual environment for {reticulate} that will not affect your other python environments.

There are several ways to do this in the documentation if you have different versions of python for different projects.

  • We will assume you only need one version for working with {reticulate}, so we will create an R environment variable to tell {reticulate} where miniconda is installed.

  • When you installed miniconda, the output told you the directory where it was installed. The location will vary between Macs and PCs.

  • Scroll back thorough your console pane for lines like the following (on a Mac):

    • Miniconda has been successfully installed at "\~/Library/r-miniconda-arm64". (in red font)
    • \[1\] "/Users/rressler/Library/r-miniconda-arm64"
  • Copy the second line with the absolute path to the miniconda library to your clipboard.

  • Use the {usethis} package (you may have to also install it using the console) to edit the hidden .renviron file for your computer.

    • It is usually under the top level user directory, but {usethis} will find it for you.
```{r}
#| eval: false
usethis::edit_r_environ()
```
  • Enter the following: RETICULATE_PYTHON="paste in the path"
  • On a Mac, it should look like RETICULATE_PYTHON="/Users/rressler/Library/r-miniconda-arm64/envs/r-reticulate/bin/python"
  • Save the file and close it.
  • Restart RStudio.

Reload {reticulate}.

```{r}
library(reticulate)
```

Now you want to check your configuration again

```{r}
py_config()
```
python:         /Users/rressler/Library/r-miniconda-arm64/envs/r-reticulate/bin/python
libpython:      /Users/rressler/Library/r-miniconda-arm64/envs/r-reticulate/lib/libpython3.10.dylib
pythonhome:     /Users/rressler/Library/r-miniconda-arm64/envs/r-reticulate:/Users/rressler/Library/r-miniconda-arm64/envs/r-reticulate
version:        3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]
numpy:          /Users/rressler/Library/r-miniconda-arm64/envs/r-reticulate/lib/python3.10/site-packages/numpy
numpy_version:  1.26.4

NOTE: Python version was forced by RETICULATE_PYTHON

It should look something like the following with your r-reticulate version of python in the top line:

  python:         /Users/rressler/Library/r-miniconda-arm64/envs/r-reticulate/bin/python
  libpython:      /Users/rressler/Library/r-miniconda-arm64/envs/r-reticulate/lib/libpython3.8.dylib
  pythonhome:     /Users/rressler/Library/r-miniconda-arm64/envs/r-reticulate:/Users/rressler/Library/r-miniconda-arm64/envs/r-reticulate
  version:        3.8.16 | packaged by conda-forge | (default, Feb  1 2023, 16:01:13)  [Clang 14.0.6 ]
  numpy:          /Users/rressler/Library/r-miniconda-arm64/envs/r-reticulate/lib/python3.8/site-packages/numpy
  numpy_version:  1.24.2
  • To validate Python was installed and is available, run reticulate::py_available() with argument initialize = TRUE.
```{r}
#| eval: false
py_available(initialize = TRUE)
```
  • Note: This may work even if you have other versions of python installed. We will check the configuration again after installing several packages.

To install Python packages, use py_install().

  • For example, we can install the {pandas}, {matplotlib} and {seaborn} packages using the following in the Console pane.
```{r}
#| label: install-packages
#| eval: false
py_install(packages = c("pandas", "matplotlib", "seaborn"))
```

To test that everything is configured run the following.

```{r}
np <- import("numpy")
pd <- import("pandas")
plt <- import("matplotlib.pyplot")
sns <- import("seaborn")
```

If all the packages load without error, your configuration is set to use {reticulate} with python.

17.3 Working with Python in RStudio

17.3.1 The Python REPL Shell

You are used to using the RStudio console as an interactive environment for working in R.

The console can also run an Interactive Python (IPython) shell.

To start an IPython shell run the following in the console:

```{r}
#| eval: false
reticulate::repl_python()
```

This creates a Python “REPL” (“read–eval–print loop”) environment, which is an interactive programming environment.

Figure 17.1: RStudio Console REPL

Exit the REPL by typing exit or quit.

yellow”>Py

```{python}
#| eval: false
exit
```

17.3.2 Python Code Chunks

You can have Python chunks in quarto files by replacing the “r” at the beginning of the chunk by “python”.

```{python}
#| label: python-code-chunk

# Code goes here
```

You can access R objects in Python using the r object.

  • That is, r.x will access, in Python, the x variable defined using R.
```{r}
x <- c(1, 4, 6, 2)
```

yellow”>Py

```{python}
r.x
```
[1.0, 4.0, 6.0, 2.0]

You can access Python objects in R using the py object.

  • That is, py$x will access, in R, the x variable defined using Python.

yellow”>Py

```{python}
x = [8, 9, 11, 3]
```
```{r}
py$x
```
[1]  8  9 11  3

You can also begin a Python REPL by also hitting Control/Command + Enter inside the Python chunk (see Tools/Keyboard Shortcuts Help/Execute).

17.3.3 Python Script Files

Python scripts (where there is only Python code and no plain text) end in “.py”.

You can create a Python script in RStudio using the menu for File/New File/Python Script or the drop-down icon.

Figure 17.2: Starting a new python script using the File menu.
Figure 17.3: Starting a new python script using the dropdown menu.

yellow”>Py

Create a new script and enter the following two lines:

print("Hello from Python script")

x

Use Run or Control/Command + Enter on each line of the Python script to start a Python REPL (if not already running) and execute the script in the console, as in Figure 17.4.

Figure 17.4: Results of running the new python script.

17.3.4 RStudio Environment Pane

You can switch the RStudio Environment from R to Python to see the different values in each environment.

  • Objects in the Python environment also show their methods.
Figure 17.5: R environment.
Figure 17.6: Python environment.

17.4 Differences between R and Python for the R user

```{r}
library(reticulate)
```

17.4.1 White Space

White space matters in Python.

  • In R, expressions are grouped into a code block with the curly braces operator { }.

In Python, expressions are grouped by making the expressions share an indentation level.

  • For example, an expression with an R code block might be:

if (TRUE) {

This is one expression.

This is another expression.

}

  • The equivalent in Python:

if True:

  print("This is one expression.")
  print("This is another expression.")

Python accepts tabs or spaces as the indentation spacer, but the rules get tricky when they’re mixed.

  • Most style guides suggest, and IDEs default to, using only spaces.

17.4.2 Container Types

In R, the list() is a container you can use to organize R objects.

  • There is no single direct equivalent in Python that supports all the same features.

Instead there are (at least) 4 different Python container types you will use:

  • lists,
  • dictionaries,
  • tuples, and
  • sets.

17.4.3 Lists

Python lists are typically created using bare brackets [ ].

  • The Python built-in list() function is more of a coercion function, closer in spirit to R’s as.list().

Python lists are modified in place, not copied and modified.

  • Syntax for Python lists includes using + and * with lists.
  • These are concatenation and replication operators, akin to R’s c() and rep().
Important

Indexing starts with 0 not 1.

17.4.4 Dictionaries

Dictionaries are most similar to R environments.

They are a container where you can retrieve items by name, though in Python the name (called a key in Python), does not need to be a string like in R.

  • It can be any Python object with a hash() method.
  • Note using r_to_py() converts R named lists to python dictionaries.

17.4.5 Defining Functions with def

Python functions are defined with the def statement.

The syntax for specifying function arguments and default values is very similar to R.

yellow”>Py

```{python}
#| eval: false
def my_function(name = "World"):
  print("Hello", name)
  
my_function()
my_function("Friend") 
```

The equivalent R snippet would be:

```{r}
my_function <- function(name = "World") {
  cat("Hello", name, "\n")
}

my_function()
my_function("Friend")
```
Hello World 
Hello Friend 
Important

A Key Difference: Unlike R functions, the last value in a function is not automatically returned. Python requires an explicit return statement.

You can define Python functions that take a variable number of arguments, similar to ... in R.

  • A notable difference is that R’s ... makes no distinction between named and unnamed arguments, but Python does.
  • In Python, prefixing with a single * captures unnamed arguments, and two ** signifies that keyword arguments are captured.

17.4.6 Defining Classes with class

One could argue that in R, the preeminent unit of composition for code is the function, and in Python, it’s the class.

  • You can be a very productive R user and never explicitly use object-oriented constructs such as R6, reference classes, or similar R equivalents to the Python classes.

In Python however, understanding the basics of how class objects work is requisite knowledge, because classes are how you organize and find methods in Python.

Like the def statement, the class statement binds an object to a new callable symbol, MyClass.

Python uses a strong naming convention, classes are typically CamelCase, and functions are typically snake_case.

After defining MyClass, you can interact with it, and see that it has type ‘type’.

Calling MyClass() creates a new object instance of the class, which has type ‘MyClass’.

Functions defined inside a class code block are called methods.

  • Each time a method is called from a class instance, the instance is put into the function call as the first argument.

17.4.7 Integers and Floats

R users generally don’t need to be aware of the difference between integers and floating point numbers, but that’s not the case in Python.

  • In R, writing a bare literal number like 12 produces a floating point type, whereas in Python, it produces an integer.
  • You can produce an integer literal in R by appending an L, as in 12L.

Many Python functions expect integers, and will error when provided a float.

17.4.8 Sourcing Scripts

The source_python() function will source a Python script and make the objects it creates available within an R environment (by default the calling environment).

For example, use the RStudio File/New File/Python Script command to open a script file.

  • Enter the following lines.
  • Save it as add.py in a py directory below your working directory.

yellow”>Py

```{python}
#| eval: false
def add(x, y):
  return x + y
```

Source it using the source_python() function (with a relative path).

Then you can run the python add() function directly from R or in an R code chunk:

```{r}
source_python('./py/add.py')
add(5, 10)
```
[1] 15

17.4.9 Object Conversion

By default, when Python objects are returned to R they are converted to their equivalent R types.

However, if you’d rather make conversion from Python to R explicit, and work with native Python objects by default, you can pass convert = FALSE to the import function.

In this case, Python to R conversion will be disabled for the module returned from import.

For example:

```{r}
# import numpy and specify no automatic Python to R conversion
np <- import("numpy", convert = FALSE)

# do some array manipulations with NumPy
a <- np$array(c(1:4))
sum <- a$cumsum()

# convert to R explicitly at the end
py_to_r(sum)
```
[1]  1  3  6 10
  • As illustrated above, if you need access to an R object at end of your computations, you can call the py_to_r() function explicitly.

17.5 Numerical Python (NumPy)

In R Use In Python Use
Base R numpy
dplyr/tidyr pandas
ggplot2 matplotlib/seaborn

Create a Python script in RStudio or use python code chunks for the examples that follow.

17.5.1 Numpy Arrays

{NumPy} (short for Numerical Python) is the fundamental package or library for scientific computing in Python.

  • {NumPy} is a Python library for working with arrays of data that enables efficient storage and data operations as the arrays grow larger.

{NumPy} provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays.

  • These include: mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

Let’s import the {numpy} package:

Tip

When importing a python package, we can create a nickname (alias) for it.

  • They are shorter than the package name to allow for less typing.

  • Use import my_package as my_nickname.

Many packages have “standard” nicknames and you should avoid using the nicknames for other objects even if not importing the package.

The standard nickname for {numpy} is np.

```{r}
#| message: false
library(reticulate)
```

yellow”>Py

```{python}
import numpy as np
```

In Python, assign variables with =, not <-.

yellow”>Py

```{python}
x = 10
x
```
10

The arithmetic operations (+, -, *, /) are the same.

yellow”>Py

```{python}
#| eval: false
x * 2
x + 2
x / 2
x - 2
x ** 2 ## square
x % 2 ## remainder
```

Comments also begin with a #.

yellow”>Py

```{python}
#| eval: false
## This is a comment
```

Help files are called the same way.

yellow”>Py

```{python}
#| eval: false
help(min)
?min
```

Python lists are like R lists in that they can have the different types.

You create Python lists with brackets [ ].

yellow”>Py

```{python}
x = ["hello", 1, True]
x
```
['hello', 1, True]

NumPy Arrays (class ndarray) are the Python equivalent to R atomic vectors (where each element must be the same type) but may have multiple dimensions.

  • You use the array() method of the {numpy} package to create a numpy array (you give it a list as input).
  • When calling a method from a package, you have to fully specify the package name and method using the syntax package.method()

yellow”>Py

```{python}
vec = np.array([2, 3, 5, 1])
vec
type(vec)
```
array([2, 3, 5, 1])
<class 'numpy.ndarray'>

You can do vectorized operations on NumPy arrays.

yellow”>Py

```{python}
#| eval: false
vec + 2
vec - 2
vec * 2
vec / 2
2 / vec
```

Two vectors of the same size (length) can be added/subtracted/multiplied/divided.

yellow”>Py

```{python}
#| eval: false
x = np.array([1, 2, 3, 4])
y = np.array([5, 6, 7, 8])
x + y
x - y
x / y
x * y
x ** y
```

You extract or subset individual elements of an array just like in R, using brackets [].

yellow”>Py

```{python}
vec
vec[0]
vec[0:2]
```
array([2, 3, 5, 1])
2
array([2, 3])
Important

Indexing of numpy ndarrays is different from R in that the numpy slicing syntax x:y selects from element x up to y-1. See Indexing on ndarrays

You can extract arbitrary (non-contiguous) elements by subsetting with an index array.

yellow”>Py

```{python}
#| eval: false
ind = np.array([0, 2])
vec[ind]
## or
vec[np.array([0, 2])]
```

Key Difference: Python starts counting from 0, not 1.

  • So the first element of a vector is vec[0], not vec[1].
  • You can still use negative indices for indexing from the end of the array.

Combine two arrays via np.concatenate() (note the use of brackets here inside the ([]) to create a list).

yellow”>Py

```{python}
x = np.array([1, 2, 3, 4])
y = np.array([5, 6, 7, 8])
np.concatenate([x, y])
```
array([1, 2, 3, 4, 5, 6, 7, 8])

17.5.2 Useful Functions for Operating on Vectors

In R, functions operate on objects (e.g. log(x), sort(x), etc.).

Python also has functions that operate on objects.

  • But, objects usually have functions that are directly associated with them.
  • These functions are called “methods”.

You access these functions by fully specifying the object and method with the syntax object.function(), i.e., a period between the object name and the function name.

  • You can use tab completion to scroll through the available methods of an object.
  • Recall that the object is placed as the first argument in the class method.

Start a REPL with reticulate::repl_python() (if not running) and enter ?vec.sort in the REPL console.

yellow”>Py

```{python}
#| eval: false
vec.sort() ## sort
vec.min() ## minimum
vec.max() ## maximum
vec.mean() ## mean
vec.sum() ## sum
vec.var() ## variance
```

There are many other useful np.* functions that operate on objects.

yellow”>Py

```{python}
#| eval: false
np.sort(vec)
np.min(vec)
np.max(vec)
np.mean(vec)
np.sum(vec)
np.var(vec)
np.size(vec)
np.exp(vec)
np.log(vec)
```

17.5.3 Booleans (Python’s logicals)

Python uses True and False. It uses the same comparison operators as R.

  • These are also vectorized.

yellow”>Py

```{python}
#| eval: false
vec > 3
vec < 3
vec == 3
vec != 3
vec <= 3
vec >= 3
```

The logical operators have a Key Difference: “Not” uses a different character, the tilde ~.

  • & And
  • | Or
  • ~ Not

yellow”>Py

```{python}
np.array([True, True, False, False]) & np.array([True, False, True, False])
np.array([True, True, False, False]) | np.array([True, False, True, False])
~ np.array([True, True, False, False])
```
array([ True, False, False, False])
array([ True,  True,  True, False])
array([False, False,  True,  True])

You subset a vector using Booleans inside the [ ] as you would in R.

yellow”>Py

```{python}
vec[vec <= 3]
```
array([2, 3, 1])

When you are dealing with single logicals, instead of arrays of logicals, use and, or, and not instead.

yellow”>Py

```{python}
True and False
True or False
not True
```
False
True
False

17.5.3.1 Exercise:

  • Consider two vectors.

\[ y = (1, 7, 1, 2, 8, 2)\\ x = (4, 6, 2, 7, 8, 2) \]

Calculate their inner product.

\[y_1x_1 + y_2x_2 + y_3x_3 + y_4x_4 + y_5x_5 + y_6x_6\]

Do this using vectorized operations.

yellow”>Py

Show code
```{python}
#| eval: false
#| code-fold: true
y = np.array([1, 7, 1, 2, 8, 2])
x = np.array([4, 6, 2, 7, 8, 2])
np.sum(x * y)
```

17.5.3.2 Exercise

Provide two ways of extracting the 2nd and 5th elements of this vector.

yellow”>Py

```{python}
x = np.array([4, 7, 8, 1, 2])
```
Show code
```{python}
#| code-fold: true
#| eval: false
x[np.array([1, 4])]
x[np.array([False, True, False, False, True])]
x[np.array([-4, -1])]
```

17.5.3.3 Exercise:

Extract all elements from the previous vector between 5 and 8 (inclusive). Use predicates.

yellow”>Py

Show code
```{python}
#| code-fold: true
#| eval: false
## Note: Need parentheses here for multiple conditions
x[(x >= 5) & (x <= 8)]
```

17.5.4 Multi-Dimensional Arrays

NumPy can also support multi-dimensional arrays.

yellow”>Py

```{python}
np.random.seed(1)  ## seed for reproducibility

x1 = np.random.randint(10, size=6)  ## One-dimensional array
x2 = np.random.randint(10, size=(3, 4))  ## Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5))  ## Three-dimensional array
```

Each array has attributes ndim (the number of dimensions), shape (the size of each dimension), and size (the total size of the array).

yellow”>Py

```{python}
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
```
x3 ndim:  3
x3 shape: (3, 4, 5)
x3 size:  60

In a multi-dimensional array, items can be accessed using a comma-separated tuple of indices.

yellow”>Py

```{python}
x2
x2[2, -1] ## third element in first dimension last element in second dimension
```
array([[1, 7, 6, 9],
       [2, 4, 5, 2],
       [4, 2, 4, 7]])
7

Values can also be modified using any of the above index notation.

yellow”>Py

```{python}
x2[0, 0] = 12
x2
```
array([[12,  7,  6,  9],
       [ 2,  4,  5,  2],
       [ 4,  2,  4,  7]])
Important

Key Difference: NumPy arrays have a fixed type.

  • This means if you attempt to insert a floating-point value into an integer array, the value will be silently truncated.

NumPy has capabilities to reshape and index multi-dimensional arrays.

17.5.5 Heterogeneous Data

NumPy uses “structured arrays” and “record arrays” to provide efficient storage for compound, heterogeneous data.

  • They could be considered conceptually similar to R list vectors in that all elements do not have to be of the same type.

17.6 Data Manipulation with Pandas

  • This section is based primarily on work by Professor David Gerard.

17.6.1 Intro to Pandas

{Pandas} is a newer package built on top of {NumPy}, and provides an efficient implementation of a data frame.

Python DataFrames are essentially multi-dimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.

  • {Pandas} objects can be thought of as enhanced versions of {NumPy} structured arrays in which the rows and columns are identified with labels rather than simple integer indices.

Import both {NumPy} and {Pandas} into the Python environment.

yellow”>Py

```{python}
import numpy as np
import pandas as pd
```

A Pandas Series is a one-dimensional array of indexed data.

  • It can be created from a list or array as follows and we can see the values using the object.values syntax.

yellow”>Py

```{python}
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data
data.values
```
0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64
array([0.25, 0.5 , 0.75, 1.  ])

The essential difference with a Numpy array is the Pandas Series has an explicitly defined index associated with the values instead of an implicit index.

  • Note, the index need not be an integer, but can consist of values of any desired type.
  • For example, if we wish, we can use strings as an index for the series.

yellow”>Py

```{python}
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data
```
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

If a Series is an analog of a one-dimensional array with flexible indices, a “DataFrame” is an analog of a two-dimensional array with both flexible row indices and flexible column names.

  • Think of a DataFrame as a sequence of aligned Series objects.

  • Here, by “aligned” we mean they share the same index.

17.6.2 Pandas versus Tidyverse

Keep these equivalencies in mind:

  • <Series>.fun() means fun() is a method of the <Series> object. -<DataFrame>.fun() means fun() is a method of the <DataFrame> object.

    tidyverse pandas
    arrange() <DataFrame>.sort_values()
    bind_rows() pandas.concat()
    filter() <DataFrame>.query()
    pivot_longer() <DataFrame>.melt()
    glimpse() <DataFrame>.info() and <DataFrame>.head()
    group_by() <DataFrame>.groupby()
    if_else() numpy.where()
    left_join() pandas.merge()
    library() import
    mutate() <DataFrame>.eval() and <DataFrame>.assign()
    read_csv() pandas.read_csv()
    recode() <DataFrame>.replace()
    rename() <DataFrame>.rename()
    select() <DataFrame>.filter() and <DataFrame>.drop()
    separate() <Series>.str.split()
    slice() <DataFrame>.iloc()
    pivot_wider() <DataFrame>.pivot_table().reset_index()
    summarize() <DataFrame>.agg()
    unite() <Series>.str.cat()
    |>, or %>%| Enclose pipeline in()`

17.6.3 Importing Libraries

Python: import <package> as <alias>.

yellow”>Py

```{python}
import numpy as np
import pandas as pd
```

Use the alias (nickname) you define during the import in place of the package name.

  • R equivalent
```{r}
#| message: false
library(tidyverse)
```

17.6.4 Reading in and Printing Data

Pandas has a family of functions for reading in data from fixed-width files.

Python: pd.read_csv().

yellow”>Py

```{python}
estate = pd.read_csv("https://raw.githubusercontent.com/AU-datascience/data/main/413-613/estate.csv")
```

R equivalent:

```{r}
#| message: false
estate <- read_csv("https://raw.githubusercontent.com/AU-datascience/data/main/413-613/estate.csv")
```

Use the info() and head() methods to see the data.

yellow”>Py

```{python}
estate.info()
```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 522 entries, 0 to 521
Data columns (total 12 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Price    522 non-null    int64 
 1   Area     522 non-null    int64 
 2   Bed      522 non-null    int64 
 3   Bath     522 non-null    int64 
 4   AC       522 non-null    int64 
 5   Garage   522 non-null    int64 
 6   Pool     522 non-null    int64 
 7   Year     522 non-null    int64 
 8   Quality  522 non-null    object
 9   Style    522 non-null    int64 
 10  Lot      522 non-null    int64 
 11  Highway  522 non-null    int64 
dtypes: int64(11), object(1)
memory usage: 49.1+ KB

yellow”>Py

```{python}
estate.head()
```
    Price  Area  Bed  Bath  AC  ...  Year  Quality  Style    Lot  Highway
0  360000  3032    4     4   1  ...  1972   Medium      1  22221        0
1  340000  2058    4     2   1  ...  1976   Medium      1  22912        0
2  250000  1780    4     3   1  ...  1980   Medium      1  21345        0
3  205500  1638    4     2   1  ...  1963   Medium      1  17342        0
4  275500  2196    4     3   1  ...  1968   Medium      7  21786        0

[5 rows x 12 columns]

R equivalent:

```{r}
head(estate)
glimpse(estate)
```
# A tibble: 6 × 12
   Price  Area   Bed  Bath    AC Garage  Pool  Year Quality Style   Lot Highway
   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <chr>   <dbl> <dbl>   <dbl>
1 360000  3032     4     4     1      2     0  1972 Medium      1 22221       0
2 340000  2058     4     2     1      2     0  1976 Medium      1 22912       0
3 250000  1780     4     3     1      2     0  1980 Medium      1 21345       0
4 205500  1638     4     2     1      2     0  1963 Medium      1 17342       0
5 275500  2196     4     3     1      2     0  1968 Medium      7 21786       0
6 248000  1966     4     3     1      5     1  1972 Medium      1 18902       0
Rows: 522
Columns: 12
$ Price   <dbl> 360000, 340000, 250000, 205500, 275500, 248000, 229900, 150000…
$ Area    <dbl> 3032, 2058, 1780, 1638, 2196, 1966, 2216, 1597, 1622, 1976, 28…
$ Bed     <dbl> 4, 4, 4, 4, 4, 4, 3, 2, 3, 3, 7, 3, 5, 5, 3, 5, 2, 3, 4, 3, 4,…
$ Bath    <dbl> 4, 2, 3, 2, 3, 3, 2, 1, 2, 3, 5, 4, 4, 4, 3, 5, 2, 4, 3, 3, 3,…
$ AC      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ Garage  <dbl> 2, 2, 2, 2, 2, 5, 2, 1, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2,…
$ Pool    <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Year    <dbl> 1972, 1976, 1980, 1963, 1968, 1972, 1972, 1955, 1975, 1918, 19…
$ Quality <chr> "Medium", "Medium", "Medium", "Medium", "Medium", "Medium", "M…
$ Style   <dbl> 1, 1, 1, 1, 7, 1, 7, 1, 1, 1, 7, 1, 7, 5, 1, 6, 1, 7, 7, 1, 2,…
$ Lot     <dbl> 22221, 22912, 21345, 17342, 21786, 18902, 18639, 22112, 14321,…
$ Highway <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

17.6.5 DataFrames and Series

Pandas reads in tabular data as a DataFrame object.

  • Just as R’s data.frame is a list of atomic vectors with the same length, panda’s DataFrame contains a list of Series objects.
  • A Series object is a generalization of a numpy array. So you can use {numpy} functions on it.

yellow”>Py

```{python}
#| eval: false
x = pd.Series([1, 4, 2, 1])
x[2:3]
x[pd.Series([0, 2])]
x[x >= 2]
np.sum(x)
```

17.6.6 Extract Variables

Python: Use a where period to extract variables, e.g., dataframe.variable.

This extracts the column as a {Pandas} series.

yellow”>Py

```{python}
#| eval: false
estate.Price
```

Then you can use all of those {numpy} functions on the series.

yellow”>Py

```{python}
#| eval: false
np.mean(estate.Price)
np.max(estate.Price)
```

R equivalent: Use a $.

```{r}
#| eval: false
estate$Price
```

17.6.7 Filtering/Arranging Rows (Observations)

Filter rows based on booleans (logicals) with query().

  • The queries need to be in quotes.

yellow”>Py

```{python}
#| eval: false
estate.query('(Price > 300000) & (Area < 2500)')
```
  • Some people use bracket notation, which is more similar to base R

yellow”>Py

```{python}
#| eval: false
estate[(estate.Price > 300000) & (estate.Area < 2500)]
```

R equivalent:

```{r}
#| eval: false
filter(estate, Price > 300000, Area < 2500)
```

Select rows by numerical indices with iloc().

yellow”>Py

```{python}
estate.iloc[[1, 4, 10]]
```
     Price  Area  Bed  Bath  AC  ...  Year  Quality  Style    Lot  Highway
1   340000  2058    4     2   1  ...  1976   Medium      1  22912        0
4   275500  2196    4     3   1  ...  1968   Medium      7  21786        0
10  190000  2812    7     5   0  ...  1966      Low      7  56639        0

[3 rows x 12 columns]

R equivalent:

```{r}
slice(estate, 1, 4, 10)
```
# A tibble: 3 × 12
   Price  Area   Bed  Bath    AC Garage  Pool  Year Quality Style   Lot Highway
   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <chr>   <dbl> <dbl>   <dbl>
1 360000  3032     4     4     1      2     0  1972 Medium      1 22221       0
2 205500  1638     4     2     1      2     0  1963 Medium      1 17342       0
3 160000  1976     3     3     0      1     0  1918 Low         1 32358       0

Arrange rows by sort_values().

yellow”>Py

```{python}
#| eval: false
estate.sort_values(by="Price", ascending=False)
```

R equivalent

```{r}
#| eval: false
arrange(estate, desc(Price))
```

17.6.7.1 Exercise:

Use {pandas} and then {tidyverse} to extract all medium quality homes that have a pool and arrange the rows in increasing order of price.

Show code
```{python}
#| eval: false
#| code-fold: true
temp = estate.query('(Quality == "Medium") & (Pool > 0)')

temp.sort_values(by="Price")

estate.query('(Quality == "Medium") & (Pool > 0)').sort_values(by="Price").head()
```

Note the use of the period . to “pipe” to subsequent methods.

Show code
```{r}
#| eval: false
#| code-fold: true
estate  |> 
  filter(Quality == "Medium", Pool > 0) |>
  arrange(Price)
```

17.6.8 Selecting Columns (Variables)

Column variables are selected using filter() whereas in R, filter() subsets rows.

yellow”>Py

```{python}
#| eval: false
estate.filter(["Price"])
estate.filter(["Price", "Area"])
```

Some people use bracket notation, which is more similar to Base R.

yellow”>Py

```{python}
#| eval: false
estate[["Price"]]
estate["Price"] #Note the difference as it is an array
estate[["Price", "Area"]]
```
  • The inner brackets [ ] create a Python list which is then used by the outer brackets [ ] for sub-setting the columns.

R equivalent:

```{r}
#| eval: false
estate["Price"]
estate[c("Price","Area")]
```

Dropping a column is done by drop().

  • The axis = 1 argument says to drop by columns (rather than by “index”, which is something we haven’t covered).

yellow”>Py

```{python}
#| eval: false
estate.drop(["Price", "Area"], axis = 1)
```

R: Use select() with a minus sign.

```{r}
#| eval: false
select(estate, -Price, -Area)
```

Renaming variables is done with rename().

yellow”>Py

```{python}
#| eval: false
estate.rename({'Price': 'price', 'Area': 'area'}, axis = 'columns')
```

R equivalent.

```{r}
rename(estate, price = Price, area = Area)
```
# A tibble: 522 × 12
    price  area   Bed  Bath    AC Garage  Pool  Year Quality Style   Lot Highway
    <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <chr>   <dbl> <dbl>   <dbl>
 1 360000  3032     4     4     1      2     0  1972 Medium      1 22221       0
 2 340000  2058     4     2     1      2     0  1976 Medium      1 22912       0
 3 250000  1780     4     3     1      2     0  1980 Medium      1 21345       0
 4 205500  1638     4     2     1      2     0  1963 Medium      1 17342       0
 5 275500  2196     4     3     1      2     0  1968 Medium      7 21786       0
 6 248000  1966     4     3     1      5     1  1972 Medium      1 18902       0
 7 229900  2216     3     2     1      2     0  1972 Medium      7 18639       0
 8 150000  1597     2     1     1      1     0  1955 Medium      1 22112       0
 9 195000  1622     3     2     1      2     0  1975 Low         1 14321       0
10 160000  1976     3     3     0      1     0  1918 Low         1 32358       0
# ℹ 512 more rows

17.6.8.1 Exercise:

  • Use {pandas} and then {tidyverse} to select Year, Price, and Area.
Show code
```{python}
#| eval: false
#| code-fold: true
estate.filter(["Year", "Price", "Area"])
```
Show code
```{r}
#| eval: false
#| code-fold: true
estate |>
  select(Year, Price, Area)
estate[c("Year", "Price", "Area")]
```

17.6.9 Creating New Variables (Mutate)

New variables are created in Python using eval().

  • Place the entire expression in quotes (not tick marks).

yellow”>Py

```{python}
#| eval: false
estate.eval('age = 2013 - Year')
estate.eval("age = 2013 - Year")
   ## estate.eval(`age = 2013 - Year`)
```

You can also use assign(), but then you need to reference the DataFrame as you extract variables.

yellow”>Py

```{python}
#| eval: false
estate.assign(age = 2013 - estate.Year)
```

R equivalent:

```{r}
#| eval: false
mutate(estate, age = 2013 - Year)
```

17.6.9.1 Exercise:

Use {pandas} and then {tidyverse} to create a new variable, ppa with the calculated price per unit area.

Show code
```{python}
#| eval: false
#| code-fold: true
estate.eval('ppa = Price / Area')
```
Show code
```{r}
#| eval: false
#| code-fold: true
mutate(estate, ppa = Price / Area)
```

17.6.10 Piping

All of these pandas functions return a DataFrame so we can apply methods to the DataFrame by just appending methods to the end.

Suppose we want to find the total number of beds and baths, and select the price and this total number.

Try the following code.

yellow”>Py

```{python}
#| eval: false
estate.eval('tot = Bed + Bath') ## first part as an example
estate.eval('tot = Bed + Bath').filter(["Price", "tot"])
```
  • If you want to place these operations on different lines, then just place the whole operation within parentheses similar to using {} in R for an expression.

yellow”>Py

```{python}
#| eval: false
( #start parenthesis
estate.eval('tot = Bed + Bath')
  .filter(["Price", "tot"])
) #end parenthesis
```

This looks similar to piping in R.

```{r}
#| eval: false
estate  |> 
  mutate(tot = Bed + Bath)  |> 
  select(Price, tot)
```

17.6.10.1 Exercise:

Use {pandas} with piping to extract all medium quality homes that have a pool and arrange the rows in increasing order of price.

Show code
```{python}
#| eval: false
#| code-fold: true
(
estate.query('(Quality == "Medium") & (Pool > 0)')
  .sort_values(by="Price")
)
```

17.6.11 Summaries and Grouped Summaries

Summaries can be calculated using the DataFrame’s agg() method.

  • You usually first select the columns whose summaries you want before running agg().

yellow”>Py

```{python}
(
estate.filter(["Price", "Area"])
  .agg(np.mean)
)
```
Price    277894.147510
Area       2260.626437
dtype: float64

R equivalent

```{r}
summarize(estate, Price = mean(Price), Area = mean(Area))
```
# A tibble: 1 × 2
    Price  Area
    <dbl> <dbl>
1 277894. 2261.

Use the DataFrame’s groupby() method to create group summaries.

yellow”>Py

```{python}
#| eval: false
(
estate.filter(["Price", "Area", "Bed", "Bath"])
  .groupby(["Bed", "Bath"])
  .agg(np.mean)
)
```

R equivalent

```{r}
#| eval: false
estate |>
  group_by(Bed, Bath) |>
  summarize(Price = mean(Price), Area = mean(Area))
```

You can get multiple summaries by passing a list of functions (created using [ ]).

yellow”>Py

```{python}
#| eval: false
(
estate.filter(["Price", "Area", "Quality"])
  .groupby("Quality")
  .agg([np.mean, np.var])
)
```

You can create your own functions (with dev()) and pass those.

yellow”>Py

```{python}
def cv(x):
  """Calculate coefficient of variation"""
  return(np.sqrt(np.var(x)) / np.mean(x))
```
```{python}
(
estate.filter(["Price", "Area"])
  .agg(cv)
)
```
Price    0.495841
Area     0.314242
dtype: float64

17.6.12 Recoding Variable Values

Use replace() with a dict object to recode variable values.

  • Useful with Categorical variables, the equivalent to R Factors.

yellow”>Py

```{python}
#| eval: false
estate.replace({'AC' : {0: "No AC", 1: "AC"}})
```

R equivalent:

```{r}
#| eval: false
estate |>
  mutate(AC = recode(AC, "0" = "No AC", "1" = "AC"))
```

To recode values based on logical conditions, use np.where().

yellow”>Py

```{python}
#| eval: false
estate.assign(isbig = np.where(estate.Price > 300000, "expensive", "cheap"))
```

R equivalent.

```{r}
#| eval: false
mutate(estate, isbig = if_else(Price > 300000, "expensive", "cheap"))
```

17.6.13 Reshape DataFrames Longer with Melt

Problem: One variable spread across multiple columns.

  • Column names are actually values of a variable.
  • Recall table4a from the {tidyr} package.
```{r}
data("table4a")
```

yellow”>Py

```{python}
table4a = r.table4a
table4a
```
       country      1999      2000
0  Afghanistan     745.0    2666.0
1       Brazil   37737.0   80488.0
2        China  212258.0  213766.0

Solution: melt() (similar to {data.table}).

yellow”>Py

```{python}
table4a.melt(id_vars='country', value_vars=['1999', '2000'])
```
       country variable     value
0  Afghanistan     1999     745.0
1       Brazil     1999   37737.0
2        China     1999  212258.0
3  Afghanistan     2000    2666.0
4       Brazil     2000   80488.0
5        China     2000  213766.0

R equivalent.

```{r}
pivot_longer(table4a, cols = c("1999", "2000"), 
 names_to = "variable",
 values_to = "value")
```
# A tibble: 6 × 3
  country     variable  value
  <chr>       <chr>     <dbl>
1 Afghanistan 1999        745
2 Afghanistan 2000       2666
3 Brazil      1999      37737
4 Brazil      2000      80488
5 China       1999     212258
6 China       2000     213766
  • R4DS visualization:

Pivot Longer

17.6.13.1 Exercise:

Use {pandas} to reshape the monkeymem data frame (available at https://raw.githubusercontent.com/AU-datascience/data/main/413-613/estate.csv.

  • The cell values represent identification accuracy of some objects (in percent of 20 trials).
Show code
```{python}
#| eval: false
#| code-fold: true
monkey = pd.read_csv("https://raw.githubusercontent.com/AU-datascience/data/main/413-613/monkeymem.csv")
monkeyclean = monkey.melt(id_vars=['Monkey', 'Treatment'])
```

17.6.14 Reshape Dataframes Wider with pivot_table()

Problem: One observation is spread across multiple rows.

  • One column contains variable names. One column contains values for the different variables.
  • Recall table2 from the {tidyr} package.
  • Load and assign to a python variable.
```{r}
data("table2")
```

yellow”>Py

```{python}
table2 = r.table2
```

Solution: pivot_table() followed by reset_index().

yellow”>Py

```{python}
(
table2.pivot_table(index=['country', 'year'], columns='type', 
values='count')
  .reset_index()
)
```
type      country    year     cases    population
0     Afghanistan  1999.0     745.0  1.998707e+07
1     Afghanistan  2000.0    2666.0  2.059536e+07
2          Brazil  1999.0   37737.0  1.720064e+08
3          Brazil  2000.0   80488.0  1.745049e+08
4           China  1999.0  212258.0  1.272915e+09
5           China  2000.0  213766.0  1.280429e+09
  • pivot_table() creates a table with an index attribute defined by the columns you pass to the index argument.
  • The reset_index() converts that attribute to columns and changes the index attribute to a sequence [0, 1, ..., n-1].

R equivalent.

```{r}
pivot_wider(table2, id_cols = c("country", "year"), 
    names_from = "type", values_from = "count")
```
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583
  • R4DS visualization:

17.6.14.1 Exercise:

Use {pandas} to read and reshape the flowers1 dataframe (available at https://raw.githubusercontent.com/AU-datascience/data/main/413-613/flowers1.csv).

Show code
```{python}
#| eval: false
#| code-fold: true
flowers = pd.read_csv("https://raw.githubusercontent.com/AU-datascience/data/main/413-613/flowers1.csv", sep=";", decimal=",")
```
Show code
```{python}
#| eval: false
#| code-fold: true
(
flowers.pivot_table(index=['Time', 'replication'], 
                columns='Variable', 
                values='Value')
      .reset_index()
      )
```

17.6.15 Separating a Variable into Two or more Columns

Sometimes we want to split a column based on a delimiter.

```{r}
data("table3")
```

yellow”>Py

```{python}
table3 = r.table3
table3
```
       country    year               rate
0  Afghanistan  1999.0       745/19987071
1  Afghanistan  2000.0      2666/20595360
2       Brazil  1999.0    37737/172006362
3       Brazil  2000.0    80488/174504898
4        China  1999.0  212258/1272915272
5        China  2000.0  213766/1280428583

Use object.column.str.split(pat = "", expand = True)

yellow”>Py

```{python}
table3[['cases', 'population']] = table3.rate.str.split(pat = '/',
expand = True)
table3.drop('rate', axis=1) ## remove the rate column since axis = 1
```
       country    year   cases  population
0  Afghanistan  1999.0     745    19987071
1  Afghanistan  2000.0    2666    20595360
2       Brazil  1999.0   37737   172006362
3       Brazil  2000.0   80488   174504898
4        China  1999.0  212258  1272915272
5        China  2000.0  213766  1280428583

R equivalent.

```{r}
separate_wider_delim(table3, col = "rate", delim = "/", names = c("cases", "population"))
```
# A tibble: 6 × 4
  country      year cases  population
  <chr>       <dbl> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

17.6.15.1 Exercise:

Show code
```{python}
#| eval: false
#| code-fold: true
flowers2 = pd.read_csv("https://raw.githubusercontent.com/AU-datascience/data/main/413-613/flowers2.csv", sep=";")

flowers2[['Flowers', 'Intensity']] = flowers2['Flowers/Intensity'].str.split(pat = "/", expand = True)

flowers2 = flowers2.drop('Flowers/Intensity', axis = 1)
```

17.6.16 Uniting Variables

Sometimes we want to combine two columns of character/string variables into one column.

```{r}
data("table5")
```

yellow”>Py

```{python}
table5 = r.table5
table5
```
       country century year               rate
0  Afghanistan      19   99       745/19987071
1  Afghanistan      20   00      2666/20595360
2       Brazil      19   99    37737/172006362
3       Brazil      20   00    80488/174504898
4        China      19   99  212258/1272915272
5        China      20   00  213766/1280428583

Use str.cat() to combine two columns century and year.

yellow”>Py

```{python}
(
table5.assign(year = table5.century.str.cat(table5.year))
  .drop('century', axis = 1)
)
```
       country  year               rate
0  Afghanistan  1999       745/19987071
1  Afghanistan  2000      2666/20595360
2       Brazil  1999    37737/172006362
3       Brazil  2000    80488/174504898
4        China  1999  212258/1272915272
5        China  2000  213766/1280428583

R equivalent.

```{r}
unite(table5, century, year, col = "year", sep = "")
```
# A tibble: 6 × 3
  country     year  rate             
  <chr>       <chr> <chr>            
1 Afghanistan 1999  745/19987071     
2 Afghanistan 2000  2666/20595360    
3 Brazil      1999  37737/172006362  
4 Brazil      2000  80488/174504898  
5 China       1999  212258/1272915272
6 China       2000  213766/1280428583

17.6.16.1 Exercise:

  • Use {pandas} to re-unite the data frame you separated from the flowers2 exercise.
  • Use a comma for the separator.
Show code
```{python}
#| code-fold: true
#| eval: false
flowers2.assign(ratio = flowers2.Flowers.str.cat(flowers2.Intensity, sep = ","))
```

17.6.17 Combining and Joining Two DataFrames

We will use these DataFrames for the examples below.

yellow”>Py

```{python}
#| eval: false
xdf = pd.DataFrame({"mykey": np.array([1, 2, 3]), 
                    "x": np.array(["x1", "x2", "x3"])})
ydf = pd.DataFrame({"mykey": np.array([1, 2, 4]), 
                    "y": np.array(["y1", "y2", "y3"])})
xdf
ydf
```
```{r}
#| eval: false
#| layout-ncol: 2
xdf <- tibble(mykey = c("1", "2", "3"),
                  x_val = c("x1", "x2", "x3"))
ydf <- tibble(mykey = c("1", "2", "4"),
              y_val = c("y1", "y2", "y3"))
xdf
ydf
```

Binding rows is done with pd.concat().

yellow”>Py

```{python}
#| eval: false
pd.concat([xdf, ydf])
```

R equivalent.

```{r}
#| eval: false
bind_rows(xdf, ydf)
```
Important

All joins use pd.merge().

Inner Join (visualization from R4DS).

Inner Join.

yellow”>Py

```{python}
#| eval: false
pd.merge(left=xdf, right=ydf, how="inner", on="mykey")
```
```{r}
#| eval: false
inner_join(xdf, ydf, by = "mykey")
```

Outer Joins (visualization from RDS):

Outer Joins

Left Join

yellow”>Py

```{python}
#| eval: false
pd.merge(left=xdf, right=ydf, how="left", on="mykey")
```
```{r}
#| eval: false
left_join(xdf, ydf, by = "mykey")
```

Right Join

yellow”>Py

```{python}
#| eval: false
pd.merge(left=xdf, right=ydf, how="right", on="mykey")
```
```{r}
#| eval: false
right_join(xdf, ydf, by = "mykey")
```

Full Join (try not to use very often …)

yellow”>Py

```{python}
#| eval: false
pd.merge(left=xdf, right=ydf, how="outer", on="mykey")
```
```{r}
#| eval: false
full_join(xdf, ydf, by = "mykey")
```
  • Use the left_on and right_on arguments if the keys are named differently.
  • The on argument can take a list of key names if your key is multiple columns.

17.7 Python Matplotlib and Seaborn Libraries

  • This section is based primarily on work by Professor David Gerard.

{Matplotlib} is a comprehensive library for creating static, animated, and interactive visualizations in Python.

  • {Matplotlib} graphs your data onto Figures (e.g., windows, Jupyter widgets, etc.), each of which can contain one or more axes.

{Seaborn} is a library for making statistical graphics in Python.

  • It builds on top of {matplotlib} and integrates closely with {pandas} Series and DataFrames.
  • It lets you focus on what the different elements of your plots mean, rather than on the details of how to draw them.

Seaborn has different modules of functions for different analytic purposes, e.g., “relational”, “distributional”, and “categorical” modules.

Seaborn Function Modules. Michael Waskom (2023)

Seaborn has extensive documentation, a tutorial as well as a gallery of examples.

Let’s load {reticulate} and {ggplot2} in R and load the mpg data set.

```{r}
library(reticulate)
library(ggplot2)
data("mpg")
```
Note

All other code in this section will be Python unless otherwise marked.

Import matplotlib.pyplot and {seaborn} in python and create a python dataframe from mpg.

```{python}
#| label: import-python-packages
import matplotlib.pyplot as plt
import seaborn as sns
mpg = r.mpg #create a python object from the R data frame
import warnings # stop a future warning bug un seaborn from generating future warning messages
# see https://github.com/mwaskom/seaborn/issues/3486
warnings.simplefilter(action="ignore", category=FutureWarning)
```

17.7.1 Create, Show, and Clear plots.

Use a plotting function to create a plot object.

  • Matplotlib will return information about the plot unless you assign a name to it.
  • Use .set(title='My Title') to add a title.

Use plt.show() to display a plot that has been created.

Use plt.close() to close the plot and free memory.

  • Use plt.clf() to clear a plot before making a new plot on the same layout.

17.7.2 One Quantitative Variable: Histogram

sns.histplot() makes a histogram.

```{python}
#| message: false
plt.close()
my_plot = sns.histplot(x = 'hwy', data = mpg).set(title='sns.histplot')
plt.show()
```

17.7.3 One Categorical Variable: Barplot

Use sns.countplot() to make a barplot of the distribution of a categorical variable.

```{python}
plt.close()
my_plot = sns.countplot(x = 'class', data = mpg).set(title='sns.countplot')
plt.show()
```

17.7.4 One Quantitative Variable, One Categorical Variable: Boxplot

Use sns.boxplot() to make boxplots.

```{python}
plt.close()
my_plot = sns.boxplot(x = 'class', y = 'hwy', data = mpg).set(title='sns.boxplot')
plt.show()
```

A boxenplot is a cool graphic that gives you more quantiles.

```{python}
plt.close()
my_plot = sns.boxenplot(x='class', y='hwy', data=mpg) .set(title='sns.boxenplot')
plt.show() 
```

17.7.5 Two Quantitative Variables: Scatterplot

Use sns.scatterplot() to make a basic scatter plot.

```{python}
plt.close()
my_plot = sns.scatterplot(x='displ', y='hwy', data=mpg).set(title='sns.scatterplot')
plt.show()
```

17.7.6 Lines/Smoothers

Use sns.regplot() to make a scatter plot with a regression line or a loess smoother.

  • Regression line with 95% Confidence interval.
```{python}
plt.close()
my_plot = sns.regplot(x = 'displ', y = 'hwy', data = mpg)
plt.show()
```

  • Loess smoother with confidence interval removed.
```{python}
plt.close()
my_plot = sns.regplot(x = 'displ', y = 'hwy', data = mpg, lowess = True, ci = 'None')
plt.show()
```

17.7.7 Annotating by a Third Variable

Use the hue or style arguments to annotate by a categorical variable:

```{python}
plt.close()
my_plot = sns.scatterplot(x='displ', y='hwy', hue='class', data=mpg)
plt.show()
```

```{python}
plt.close()
my_plot = sns.scatterplot(x='displ', y='hwy', style='class', data=mpg)
plt.show()
```

Use the hue or size arguments to annotate by a quantitative variable.

```{python}
plt.close()
my_plot = sns.scatterplot(x='cty', y='hwy', hue='displ', data=mpg)
plt.show()
```

```{python}
plt.close()
my_plot = sns.scatterplot(x='cty', y='hwy', size='displ', data=mpg)
plt.show()
```

17.7.8 Two Categorical Variables: Mosaic Plot

When you want to plot two categorical variables, try a mosaic plot from the {statsmodels} package.

Import the plot object (and methods) from statsmodels.graphics.mosaicplot.

```{python}
#| label: mosaic-plot-1
#| results: hide
from statsmodels.graphics.mosaicplot import mosaic
```
```{python}
#| fig.height: 8.0
#| fig.width: 8.0

#| fig.keep: all
plt.close()
my_plot = mosaic(data = mpg, index=['class', 'drv'])
plt.show()
```

17.7.9 Facets

Use sns.FacetGrid() followed by the map() method to create faceted plots.

  • The FacetGrid method creates the faceted plot structure and the map() method says how to assign the plot types and variables to the facets.
```{python}
#| label: Faceted Plot
plt.close()

my_plot = sns.FacetGrid(data = mpg, row = 'drv').map(sns.histplot,\
'hwy', kde = False)

plt.show()
```

A density plot version.

```{python}
plt.clf()

my_plot = sns.FacetGrid(data = mpg, row = 'drv').map(sns.histplot, 'hwy', \
kde = True) #`kde` use Kernel Density Estimate

plt.show()
```

17.7.10 Labels

Use plt.close() to close the figure window and start fresh.

You can add labels by assigning a plot to an object and then using set_*() methods to add labels.

```{python}
#| fig.height: 3.0
#| fig.width: 6.0
plt.close()
scatter = sns.scatterplot(x='displ', y='hwy', data=mpg)
scatter.set_xlabel('Displacement')
scatter.set_ylabel('Highway')
scatter.set_title('Highway versus Displacement')
plt.show()
```

17.7.11 Saving Figures

  1. First, assign a figure to an object.
```{python}
#| eval: false
scatter = sns.scatterplot(x = 'displ', y = 'hwy', data = mpg)
```
  1. Extract the figure. Assign this to an object.
```{python}
#| eval: false
fig = scatter.get_figure()
```
  1. Save the figure.
```{python}
#| label: save-figure-3
#| eval: false
fig.savefig('./output/scatter.pdf')
```

You can do all of these steps using piping.

  • Note the use of \ to break the line instead of enclosing in ().
```{python}
#| eval: false
sns.scatterplot(x='displ', y='hwy', data=mpg) \
  .get_figure() \
  .savefig('./output/scatter.pdf')
```

17.8 Examples of R and Python

17.8.1 Running the Python Linear Model in R

```{r}
#| message: false
library(dplyr)
library(reticulate)
```

The {scikit-learn} library is on Pypi.

  • Install using the terminal window by entering pip install scikit-learn.

Create a python script with the following code and save in a py directory.

   from sklearn import linear_model as py_lm
   linreg_python = py_lm.LinearRegression()

Source the script in R with source_python() and a relative path.

```{r}
source_python("./py/e_example1.py")
```

Glimpse mtcars in R.

```{r}
glimpse(mtcars)
```
Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

Check out the python linear regression function using View().

  • Note: The function of a python package is accessed using $ symbol after the object into which the Python library is loaded.
  • This is very similar to how a column of a data frame is accessed using the $ operator.
```{r}
#| eval: false
View(linreg_python)
py_help(linreg_python$fit)
```

Run the model in R using the python method fit for (mpg ~ everything_else).

  • Separate the response y and the explanatory variable x.
  • Note, linreg_python$fit uses X for the explanatory variables.
```{r}
x <- mtcars[,-1]
y <- mtcars$mpg
py_lmout <- linreg_python$fit(X = x, y = y)
```

Look at the output.

```{r}
#| label: show-lm-r-p
py_fit <- tibble(var = c("Intercept", names(mtcars)[-1]), 
           python_coef = c(linreg_python$intercept_, linreg_python$coef_))
py_fit
```
# A tibble: 11 × 2
   var       python_coef
   <chr>           <dbl>
 1 Intercept     12.3   
 2 cyl           -0.111 
 3 disp           0.0133
 4 hp            -0.0215
 5 drat           0.787 
 6 wt            -3.72  
 7 qsec           0.821 
 8 vs             0.318 
 9 am             2.52  
10 gear           0.655 
11 carb          -0.199 

Compare to R lm().

```{r}
#| label: fit-lm-r
fit <- lm(mpg ~ ., data = mtcars)
bind_cols(py_fit, tibble(R_coef = coef(fit)))
```
# A tibble: 11 × 3
   var       python_coef  R_coef
   <chr>           <dbl>   <dbl>
 1 Intercept     12.3    12.3   
 2 cyl           -0.111  -0.111 
 3 disp           0.0133  0.0133
 4 hp            -0.0215 -0.0215
 5 drat           0.787   0.787 
 6 wt            -3.72   -3.72  
 7 qsec           0.821   0.821 
 8 vs             0.318   0.318 
 9 am             2.52    2.52  
10 gear           0.655   0.655 
11 carb          -0.199  -0.199 

Check the R-squared in R.

```{r}
r_squared <- linreg_python$score(x, y) # the coefficient of Determination R^2
r_squared
```
[1] 0.8690158

17.8.1.1 Python Seaborn Plots in R using {reticulate}

Based on Python Seaborn Plots in R using reticulate.

The {ggplot2} package and the python package {seaborn} have different plots.

```{r}
#| label: load-reticulate-2
library(reticulate)
```

Use an R code chunk to import the requisite packages: {pandas}, {seaborn} and the {matplotlib} method called pyplot.

```{r}
#| label: load-py-pckages-2
sns <- import('seaborn')
plt <- import('matplotlib.pyplot')
pd <- import('pandas')
```

Let’s use R’s built-in AirPassengers dataset which is a time series object.

```{r}
#| label: load-ap-data
ap <- datasets::AirPassengers
plot(ap)
```

Convert the Time-Series object into an R data frame.

```{r}
#| label: convert-ap-to-df
ap1 <- data.frame(ap,
  year = trunc(time(AirPassengers)),
  month = month.abb[cycle(AirPassengers)]) |>
  tidyr::pivot_wider(names_from = month, values_from = ap) |>
  tibble::column_to_rownames(var = "year")
```
17.8.1.1.1 Build a heatmap using {seaborn}.
  • Use r_to_py() to convert the R tibble into a python object.
```{R}
#| label: heat-map-r
#| message: false
plt$close()
ap_plot <- sns$heatmap(r_to_py(ap1), fmt = "g", cmap = 'viridis')
ap_plot$set(title = "Heat Map for AirPassengers Dataset")
```
[[1]]
Text(0.5, 1.0, 'Heat Map for AirPassengers Dataset')
plt$show()

17.8.1.1.2 Build a seaborn pairplot using pairplot().
```{r}
#| label: pairplot
#| message: true
library(palmerpenguins)
data("penguins")
p_df <- r_to_py(penguins)
plt$close()
pen_plot <- sns$pairplot(p_df, hue = 'species')
```

Display the plot.

plt$show()

Note

The seaborn plots you just saw were actually the result of a workaround suggested by the reticulate team (Tomasz Kalinowski) in response to a help request I submitted in June 2023 about why the plots were not rendering properly.

Thanks for filing. This is a current limitation of the interface. Within the reticulate knitr engine, we install hooks to customize plt.show() so that the figure appears in the rendered document. When evaluating plt$show() outside the reticulate knitr engine (that is, not in a python chunk), the hooks don’t get an opportunity to run so you get the default matplotlib behavior (showing the plot in a pop-up window).

Until this is fixed, you can work around this limitation by calling plt.show() in an (invisible) python chunk.

Create the plot in an R chunk. In a new R chunk set echo: true, eval: false with plt$show().

Follow that with a python chunk with echo: false, eval: true with import matplotlib.pyplot as plt and plt.show().

Using Python in R provides you flexibility in how you execute your analysis when you want capabilities not readily available in R.

17.9 Python Example from Quarto Documentation

This example is rendered using jupyter and not knitr but it is included in the overall set of notes files as just another file to be rendered.

R does not need to be installed for it to work.

17.9.1 Sample YAML

title: “matplotlib demo”
format:
html:
code-fold: true
jupyter: python3

For a demonstration of a line plot on a polar axis, see @fig-polar.

yellow”>Py

```{python}
#| label: fig-polar
#| fig-cap: "A line plot on a polar axis"
#| message: false

import numpy as np
import matplotlib.pyplot as plt
import warnings # stop a future warning bug un seaborn from generating future warning messages
# see https://github.com/mwaskom/seaborn/issues/3486
warnings.simplefilter(action="ignore", category=FutureWarning)

r = np.arange(0, 2, 0.01)
theta = 2 * np.pi * r
fig, ax = plt.subplots(
subplot_kw = {'projection': 'polar'}
)
ax.plot(theta, r)
ax.set_rticks([0.5, 1, 1.5, 2])
ax.grid(True)
plt.show()
```
Figure 17.7: A line plot on a polar axis