Introduction

Many relationships, such as social connections, communication networks, and biological pathways, can be represented as networks. Network visualizations are valuable exploratory tools that can summarize large datasets and provide a concise representation of data structures before quantitative models are applied. For example, it can enable us to identify likely key actors, detect potential clusters or communities and observe the overall structure of interactions, facilitating hypothesis generation and paving the way for more in-depth quantitative modeling and analysis.

More about the data

This short tutorial shows how to use tidygraph, and ggraph libraries in R to easily create and customize network visualizations. It uses real-world, messy data from one of my collaborative research projects where we study the Twitter follower-followee connections among 4000+ state legislators in the US, comprising ~160,000 ties. (I collected follower information using Twitter’s API and transformed it into network data, where a tie exists between legislator accounts i and j if i follows j. This is a directed network.).

The tutorial highlights the utility of network visualization in highlighting patterns within this follower network, aiding in the identification of clusters based on geographic, demographic and partisan affiliations, pinpointing some interesting properties of central nodes within the network, and recognizing the overall structure of the connections, among other observations. While no hard conclusions should be drawn from the visualizations alone, it provides an accessible and concise summary of the data set that is easy to share with others.

Note: My laptop has 18GB of RAM, and it took about 1 minute to render each plot.

Load required libraries

library(dplyr)
library(ggplot2)
library(igraph)
library(ggraph)
library(tidygraph)

Load data

# Read edge list and node data from RDS files
follower_edges <- readRDS("data/followers_edgelist_R1.Rds")
nodes <- readRDS("data/cleaned_nodes_R1.Rds")

Let’s look at a few rows from the edge list: The edge list comprises two columns, and each row represents an edge specified using the source node and target node. If i follows j, a tie exists between them and is recorded as an edge (i->j) where i is under column name follower_id and j is under the column name legislator_id.

# Show first 3 rows of the edge list dataframe
head(follower_edges, 3)
##     follower_id  legislator_id
## 1 str_963765775 str_2873254919
## 2  str_29012641 str_2873254919
## 3 str_123577910 str_2873254919

Let’s see how many edges are in this network:

# Display dimensions of the edge list dataframe
dim(follower_edges)
## [1] 159346      2

159346

Let’s look at a few rows from the node data: This node data has demographic (gender, race), political (party), geographic (state, contiguity) information associated with each node (legislator) in the data set.

# Show first 3 rows of the node dataframe
head(nodes[,c(-2, -4)], 3)
##           str_id   state chamber party state.abb party3 index  race gender
## 1 str_2873254919 Alabama       H     R        AL      R     1 White   male
## 2 str_1089892711 Alabama       H     R        AL      R     2 White   male
## 3  str_474388304 Alabama       H     R        AL      R     3 White female
##        mds1 in_subnet
## 1 -1.430225         1
## 2 -1.430225         1
## 3 -1.430225         1

Calculate the in-degree centrality and add node label information

Here we add some additional node information that will be useful for setting some plot aesthetics later (shape, size, label):

First, we calculate the in-degree (number of incoming follower ties) for each node (legislator).

We create a directed graph object named g_follower using the graph_from_data_frame() function from the igraph package: - d = follower_edges specifies the dataframe containing information about the edges of the graph. - vertices = nodes specifies the dataframe containing information about the vertices (nodes) of the graph.

Then, we use the degree() function on the graph created above and add this information back to the nodes dataframe. (There are other ways to calculate the in-degree value that do not involve creating a graph.)

Second, using the in-degree measure, we identify the top 5 legislators with the highest number of followers within each state and use their state abbreviations as the labels in the node dataframe, setting the remaining labels to NA. (The number 5 is arbitrary; the goal is to avoid too many overlapping labels in the dense network while retaining useful information to identify state clusters, if any are present, and to see how those with the most followers in a state are positioned in the network.)

# Create directed graph object from edge list and node data
g_follower <- graph_from_data_frame(d = follower_edges, vertices = nodes, directed = TRUE)

# Calculate in-degree for each node (number of incoming follower ties)
V(g_follower)$indegree <- degree(g_follower, mode = 'in', loops = FALSE)
nodes$follower_indegree <- V(g_follower)$indegree

# Find the top 5 most central nodes within each state and assign labels
top_5_follower <- nodes %>% 
  group_by(state) %>% 
  top_n(5, wt=follower_indegree) %>% 
  mutate(follower_labels = state.abb)

# Join labels back into nodes dataframe
nodes <- nodes %>% 
  left_join(top_5_follower[c('str_id','follower_labels')], by='str_id')

# Display first 3 rows of the nodes dataframe
head(nodes[,c(10:15)], 3)
##    race gender      mds1 in_subnet follower_indegree follower_labels
## 1 White   male -1.430225         1                25            <NA>
## 2 White   male -1.430225         1                22            <NA>
## 3 White female -1.430225         1                40              AL

Let’s also create an edge attribute that identifies whether a tie is with another legislator from the same state or a different state.

follower_edges <- follower_edges %>%
  left_join(nodes %>% select(str_id, state.abb), by=c("follower_id"="str_id")) %>%
  left_join(nodes %>% select(str_id, state.abb), by=c("legislator_id"="str_id")) %>%
  mutate(cross_state_tie = ifelse(state.abb.x==state.abb.y, "same state", "cross state")) %>%
  select(-state.abb.x, -state.abb.y)

head(follower_edges, 3)
##     follower_id  legislator_id cross_state_tie
## 1 str_963765775 str_2873254919      same state
## 2  str_29012641 str_2873254919      same state
## 3 str_123577910 str_2873254919      same state

Some descriptive information

In degree distribution looks right skewed meaning some legislators are attracting a disproportionately large number of followers compared to the rest.

# Plot indegree distribution 
ggplot(nodes, aes(x = follower_indegree)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black", alpha = 0.7) +
  geom_vline(aes(xintercept = mean(follower_indegree)), color = "red", linetype = "dashed", linewidth = 1) +
  labs(title = "Distribution of Follower Indegree",
       x = "Follower Indegree",
       y = "Frequency") +
  theme_minimal()

We have 3 values for party affiliation:

table(nodes$party3) 
## 
##    D    I    R 
## 2244   11 1853

2 categories for chamber the legislators belong to:

table(nodes$chamber)
## 
##    H    S 
## 2938 1170

7 categories for race:

table(nodes$race)
## 
## Asian or Pacific Islander                     Black                    Latino 
##                        87                       467                       217 
##                      MENA               Multiracial           Native American 
##                        14                        20                        18 
##                     White 
##                      3285

2 categories for gender:

table(nodes$gender)
## 
## female   male 
##   1400   2708

50 states:

table(nodes$state.abb)
## 
##  AK  AL  AR  AZ  CA  CO  CT  DE  FL  GA  HI  IA  ID  IL  IN  KS  KY  LA  MA  MD 
##  24  53  77  78 115  83  83  21 132 135  25  74  31 116  74  71  87  78 157 127 
##  ME  MI  MN  MO  MS  MT  NC  ND  NE  NH  NJ  NM  NV  NY  OH  OK  OR  PA  RI  SC 
##  44  93 147 126  74  47 118  28  22 137  82  46  55 167 100  80  53 171  67 101 
##  SD  TN  TX  UT  VA  VT  WA  WI  WV  WY 
##  23  83 156  63 104  30  76  96  57  21
length(table(nodes$state.abb))
## [1] 50

Number of same and cross state ties:

table(follower_edges$cross_state_tie)
## 
## cross state  same state 
##       10236      149110

Looks like legislators with the top 10 highest number of followers are democrats, belong to large states, are mostly males and white.

# Properties of legislators with the top 10 highest in-degree value
nodes %>%
  arrange(desc(follower_indegree)) %>%
  head(10) %>%
  select(party, chamber, gender, race, state)
##    party chamber gender   race         state
## 1      D       H female  White      Virginia
## 2      D       S   male  White      New York
## 3      D       H   male  White Massachusetts
## 4      D       S   male  White Massachusetts
## 5      D       H female  White      New York
## 6      D       H female Latino         Texas
## 7      D       H   male  Black       Florida
## 8      D       H   male Latino         Texas
## 9      R       H   male  White         Texas
## 10     D       H   male  Black      New York

Create the graph object and set the layout:

Now, using the updated node information, let’s recreate the graph but this time using the tbl_graph() function from the tidygraph package. The tidygraph package is designed to provide a tidy data structure for graph and network data, allowing us to manipulate and analyze graphs using the same principles as the tidyverse packages, like dplyr and ggplot2 which is very nice and keeps the code neat.

# Recreate the graph and create a tidygraph object
g_follower_tidy <- tbl_graph(
    nodes = nodes,
    edges = follower_edges,
    node_key = "str_id") %>%
  activate(nodes) %>%  # Sets context to nodes -> subsequent operations are performed on nodes
  filter(!node_is_isolated())  # Removes nodes that are isolated/do not have any follower edges

Next we use the create_layout function from the ggraph package, which defines how the nodes and edges should be arranged in the plot. The ggraph package is an extension of ggplot2 specifically designed for creating network visualizations. We use the Fruchterman-Reingold algorithm (“fr”) to set the layout of this graph.

Fruchterman-Reingold layout algorithm (Fruchterman T.M.J., Reingold E.M. 1991)., is a force-directed layout algorithm commonly used for visualizing graphs. In short, it treats the graph as a physical system where nodes are conceptualized as electrically charged particles that repel each other and the basic idea is to minimize the energy of this system. Note, the calculation of forces for all pairs of nodes can be computationally expensive, especially for large graphs and it can take some time to render the visualization.

# Set seed for layout reproducibility
set.seed(10)
# Create layout using the Fruchterman-Reingold algorithm from igraph
follower_layout <- create_layout(g_follower_tidy, layout = "igraph", algorithm = "fr")

Create the Visualiztion

Let’s start by visualizing the basic graph structure without incorporating any additional information on other variables.

# Plot the basic graph structure with default settings
ggraph(follower_layout) +
  geom_edge_link() +
  geom_node_point()

Okayyy….Let’s reduce the color intensity of the edges using the alpha option to see if we can make the nodes visible.

# Plot the graph structure with reduced edge intensity (alpha)

ggraph(follower_layout) +
  geom_edge_link(alpha=.01) + # Reduce edge intensity using alpha
  geom_node_point()

That worked really well!

Next, let’s add information about the party and chamber affiliation of each node (legislator) using the color and shape options. Let’s color the nodes based on party (Democrat, Republican, Independent) and assign node shapes based on the legislator’s chamber (House or Senate). Note default colors and shapes will be chosen if not explicitly provided:

# This code adds color and shape aesthetics to represent the party and chamber information of each node.

ggraph(follower_layout) +
  geom_edge_link(alpha = 0.01) +
  geom_node_point(
    aes(
      color = party3, # Color nodes based on party affiliation (D, R or I)
      shape = chamber # Shape nodes based on chamber (House or Senate)
    )
  ) 

Finally, let’s adjust the size of the nodes based on their follower in-degree (higher values = bigger node size) and add state labels to the top 5 nodes in each state with the highest in-degree. Let’s also customize the color of the nodes and legend labels.

# Plot the graph with additional aesthetics for color, shape, size, and labels

ggraph(follower_layout) +
  geom_edge_link(alpha = 0.01) +
  geom_node_point(aes(color = party3, # Color nodes based on party affiliation (D, R or I)
                      shape = chamber, # Shape nodes based on chamber (House or Senate)
                      alpha = follower_indegree, # Adjust node transparency based on follower indegree
                      size = follower_indegree # Adjust node size based on follower indegree
                      )) +
  scale_color_manual("Party",
                     values = c(D = "dodgerblue", # Assign color for Democrat
                                R = "firebrick2", # Assign color for Republican
                                I = "yellow")) + # Assign color for Independent
  geom_node_text(aes(label = follower_labels,
                     size = follower_indegree/3
                     )) +
  theme_graph(base_family = 'Helvetica') +
  guides(
    alpha = guide_legend(title="In-degree"),  # Customizing Legend labels 
    color = guide_legend(title = "Party"),
    shape = guide_legend(title = "Chamber"),
    size = guide_legend(title = "In-degree")
  )
## Warning: Removed 3782 rows containing missing values or values outside the scale range
## (`geom_text()`).

That looks good! And with a few easy steps we are able to create an impactful visualization and highlights interesting properties of this data.

Some Observations

More plots

We could have also set aesthetics based on different variables, like coloring the nodes by state instead of the party variable to highlight state clusters more prominently:

# Plot the graph where nodes are colored by state 

ggraph(follower_layout) +
  geom_edge_link(alpha = 0.01) +
  geom_node_point(aes(color = state, # Color nodes based on state this time
                      shape = chamber)) + # Shape nodes based on chamber (House or Senate)
  theme_graph()

Or by gender:

# Plot the graph where nodes are colored by state 

ggraph(follower_layout) +
  geom_edge_link(alpha = 0.01) +
  geom_node_point(aes(color = gender, # Color nodes based on gender this time 
                      shape = chamber # Shape nodes based on chamber (House or Senate)
                      )) + 
  theme_graph()

We can also color the edges by the edge type to mark if the tie is a same or cross state follower tie.

ggraph(follower_layout, layout = 'auto') +
  geom_edge_link(aes(color = cross_state_tie), alpha = 0.02) +  # Color ties based on tie type (Cross and Same state tie)
  geom_node_point(aes(color = party3, 
                      shape = chamber)) +
  scale_color_manual("Party",
                     values = c(D = "dodgerblue", # Assign color for Democrat
                                R = "firebrick2", # Assign color for Republican
                                I = "yellow")) +  # Assign color for Republican
  scale_edge_color_manual(values = c("same state" = "black", "cross state" = "darkgreen")) + # Set custom colors for same state and cross state ties
  theme_graph()