tidygraph
And ggraph
Many relationships, such as social connections, communication networks, and biological pathways, can be represented as networks. Network visualizations are valuable exploratory tools that can summarize large datasets and provide a concise representation of data structures before quantitative models are applied. For example, it can enable us to identify likely key actors, detect potential clusters or communities and observe the overall structure of interactions, facilitating hypothesis generation and paving the way for more in-depth quantitative modeling and analysis.
This short tutorial shows how to use tidygraph
, and
ggraph
libraries in R to easily create and customize
network visualizations. It uses real-world, messy data from one of my
collaborative research projects where we study the Twitter
follower-followee connections among 4000+ state legislators in the US,
comprising ~160,000 ties. (I collected follower information using
Twitter’s API and transformed it into network data, where a tie exists
between legislator accounts i and j if i follows j. This is a directed
network.).
The tutorial highlights the utility of network visualization in highlighting patterns within this follower network, aiding in the identification of clusters based on geographic, demographic and partisan affiliations, pinpointing some interesting properties of central nodes within the network, and recognizing the overall structure of the connections, among other observations. While no hard conclusions should be drawn from the visualizations alone, it provides an accessible and concise summary of the data set that is easy to share with others.
Note: My laptop has 18GB of RAM, and it took about 1 minute to render each plot.
library(dplyr)
library(ggplot2)
library(igraph)
library(ggraph)
library(tidygraph)
# Read edge list and node data from RDS files
follower_edges <- readRDS("data/followers_edgelist_R1.Rds")
nodes <- readRDS("data/cleaned_nodes_R1.Rds")
Let’s look at a few rows from the edge list: The edge list comprises
two columns, and each row represents an edge specified using the source
node and target node. If i follows j, a tie exists between them and is
recorded as an edge (i->j) where i is under column name
follower_id
and j is under the column name
legislator_id
.
# Show first 3 rows of the edge list dataframe
head(follower_edges, 3)
## follower_id legislator_id
## 1 str_963765775 str_2873254919
## 2 str_29012641 str_2873254919
## 3 str_123577910 str_2873254919
Let’s see how many edges are in this network:
# Display dimensions of the edge list dataframe
dim(follower_edges)
## [1] 159346 2
159346
Let’s look at a few rows from the node data: This node data has demographic (gender, race), political (party), geographic (state, contiguity) information associated with each node (legislator) in the data set.
# Show first 3 rows of the node dataframe
head(nodes[,c(-2, -4)], 3)
## str_id state chamber party state.abb party3 index race gender
## 1 str_2873254919 Alabama H R AL R 1 White male
## 2 str_1089892711 Alabama H R AL R 2 White male
## 3 str_474388304 Alabama H R AL R 3 White female
## mds1 in_subnet
## 1 -1.430225 1
## 2 -1.430225 1
## 3 -1.430225 1
Here we add some additional node information that will be useful for setting some plot aesthetics later (shape, size, label):
First, we calculate the in-degree (number of incoming follower ties) for each node (legislator).
We create a directed graph object named g_follower
using
the graph_from_data_frame()
function from the
igraph
package: - d = follower_edges
specifies
the dataframe containing information about the edges of the graph. -
vertices = nodes
specifies the dataframe containing
information about the vertices (nodes) of the graph.
Then, we use the degree()
function on the graph created
above and add this information back to the nodes dataframe. (There are
other ways to calculate the in-degree value that do not involve creating
a graph.)
Second, using the in-degree measure, we identify the top 5 legislators with the highest number of followers within each state and use their state abbreviations as the labels in the node dataframe, setting the remaining labels to NA. (The number 5 is arbitrary; the goal is to avoid too many overlapping labels in the dense network while retaining useful information to identify state clusters, if any are present, and to see how those with the most followers in a state are positioned in the network.)
# Create directed graph object from edge list and node data
g_follower <- graph_from_data_frame(d = follower_edges, vertices = nodes, directed = TRUE)
# Calculate in-degree for each node (number of incoming follower ties)
V(g_follower)$indegree <- degree(g_follower, mode = 'in', loops = FALSE)
nodes$follower_indegree <- V(g_follower)$indegree
# Find the top 5 most central nodes within each state and assign labels
top_5_follower <- nodes %>%
group_by(state) %>%
top_n(5, wt=follower_indegree) %>%
mutate(follower_labels = state.abb)
# Join labels back into nodes dataframe
nodes <- nodes %>%
left_join(top_5_follower[c('str_id','follower_labels')], by='str_id')
# Display first 3 rows of the nodes dataframe
head(nodes[,c(10:15)], 3)
## race gender mds1 in_subnet follower_indegree follower_labels
## 1 White male -1.430225 1 25 <NA>
## 2 White male -1.430225 1 22 <NA>
## 3 White female -1.430225 1 40 AL
Let’s also create an edge attribute that identifies whether a tie is with another legislator from the same state or a different state.
follower_edges <- follower_edges %>%
left_join(nodes %>% select(str_id, state.abb), by=c("follower_id"="str_id")) %>%
left_join(nodes %>% select(str_id, state.abb), by=c("legislator_id"="str_id")) %>%
mutate(cross_state_tie = ifelse(state.abb.x==state.abb.y, "same state", "cross state")) %>%
select(-state.abb.x, -state.abb.y)
head(follower_edges, 3)
## follower_id legislator_id cross_state_tie
## 1 str_963765775 str_2873254919 same state
## 2 str_29012641 str_2873254919 same state
## 3 str_123577910 str_2873254919 same state
In degree distribution looks right skewed meaning some legislators are attracting a disproportionately large number of followers compared to the rest.
# Plot indegree distribution
ggplot(nodes, aes(x = follower_indegree)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black", alpha = 0.7) +
geom_vline(aes(xintercept = mean(follower_indegree)), color = "red", linetype = "dashed", linewidth = 1) +
labs(title = "Distribution of Follower Indegree",
x = "Follower Indegree",
y = "Frequency") +
theme_minimal()
We have 3 values for party affiliation:
table(nodes$party3)
##
## D I R
## 2244 11 1853
2 categories for chamber the legislators belong to:
table(nodes$chamber)
##
## H S
## 2938 1170
7 categories for race:
table(nodes$race)
##
## Asian or Pacific Islander Black Latino
## 87 467 217
## MENA Multiracial Native American
## 14 20 18
## White
## 3285
2 categories for gender:
table(nodes$gender)
##
## female male
## 1400 2708
50 states:
table(nodes$state.abb)
##
## AK AL AR AZ CA CO CT DE FL GA HI IA ID IL IN KS KY LA MA MD
## 24 53 77 78 115 83 83 21 132 135 25 74 31 116 74 71 87 78 157 127
## ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC
## 44 93 147 126 74 47 118 28 22 137 82 46 55 167 100 80 53 171 67 101
## SD TN TX UT VA VT WA WI WV WY
## 23 83 156 63 104 30 76 96 57 21
length(table(nodes$state.abb))
## [1] 50
Number of same and cross state ties:
table(follower_edges$cross_state_tie)
##
## cross state same state
## 10236 149110
Looks like legislators with the top 10 highest number of followers are democrats, belong to large states, are mostly males and white.
# Properties of legislators with the top 10 highest in-degree value
nodes %>%
arrange(desc(follower_indegree)) %>%
head(10) %>%
select(party, chamber, gender, race, state)
## party chamber gender race state
## 1 D H female White Virginia
## 2 D S male White New York
## 3 D H male White Massachusetts
## 4 D S male White Massachusetts
## 5 D H female White New York
## 6 D H female Latino Texas
## 7 D H male Black Florida
## 8 D H male Latino Texas
## 9 R H male White Texas
## 10 D H male Black New York
Now, using the updated node information, let’s recreate the graph but
this time using the tbl_graph()
function from the
tidygraph
package. The tidygraph
package is
designed to provide a tidy data structure for graph and network data,
allowing us to manipulate and analyze graphs using the same principles
as the tidyverse packages, like dplyr and ggplot2 which is very nice and
keeps the code neat.
# Recreate the graph and create a tidygraph object
g_follower_tidy <- tbl_graph(
nodes = nodes,
edges = follower_edges,
node_key = "str_id") %>%
activate(nodes) %>% # Sets context to nodes -> subsequent operations are performed on nodes
filter(!node_is_isolated()) # Removes nodes that are isolated/do not have any follower edges
Next we use the create_layout
function from the
ggraph
package, which defines how the nodes and edges
should be arranged in the plot. The ggraph package is an extension of
ggplot2 specifically designed for creating network visualizations. We
use the Fruchterman-Reingold algorithm (“fr”) to set the layout of this
graph.
Fruchterman-Reingold layout algorithm (Fruchterman T.M.J., Reingold E.M. 1991)., is a force-directed layout algorithm commonly used for visualizing graphs. In short, it treats the graph as a physical system where nodes are conceptualized as electrically charged particles that repel each other and the basic idea is to minimize the energy of this system. Note, the calculation of forces for all pairs of nodes can be computationally expensive, especially for large graphs and it can take some time to render the visualization.
# Set seed for layout reproducibility
set.seed(10)
# Create layout using the Fruchterman-Reingold algorithm from igraph
follower_layout <- create_layout(g_follower_tidy, layout = "igraph", algorithm = "fr")
Let’s start by visualizing the basic graph structure without incorporating any additional information on other variables.
# Plot the basic graph structure with default settings
ggraph(follower_layout) +
geom_edge_link() +
geom_node_point()
Okayyy….Let’s reduce the color intensity of the edges using the alpha option to see if we can make the nodes visible.
# Plot the graph structure with reduced edge intensity (alpha)
ggraph(follower_layout) +
geom_edge_link(alpha=.01) + # Reduce edge intensity using alpha
geom_node_point()
That worked really well!
Next, let’s add information about the party and chamber affiliation of each node (legislator) using the color and shape options. Let’s color the nodes based on party (Democrat, Republican, Independent) and assign node shapes based on the legislator’s chamber (House or Senate). Note default colors and shapes will be chosen if not explicitly provided:
# This code adds color and shape aesthetics to represent the party and chamber information of each node.
ggraph(follower_layout) +
geom_edge_link(alpha = 0.01) +
geom_node_point(
aes(
color = party3, # Color nodes based on party affiliation (D, R or I)
shape = chamber # Shape nodes based on chamber (House or Senate)
)
)
Finally, let’s adjust the size of the nodes based on their follower in-degree (higher values = bigger node size) and add state labels to the top 5 nodes in each state with the highest in-degree. Let’s also customize the color of the nodes and legend labels.
# Plot the graph with additional aesthetics for color, shape, size, and labels
ggraph(follower_layout) +
geom_edge_link(alpha = 0.01) +
geom_node_point(aes(color = party3, # Color nodes based on party affiliation (D, R or I)
shape = chamber, # Shape nodes based on chamber (House or Senate)
alpha = follower_indegree, # Adjust node transparency based on follower indegree
size = follower_indegree # Adjust node size based on follower indegree
)) +
scale_color_manual("Party",
values = c(D = "dodgerblue", # Assign color for Democrat
R = "firebrick2", # Assign color for Republican
I = "yellow")) + # Assign color for Independent
geom_node_text(aes(label = follower_labels,
size = follower_indegree/3
)) +
theme_graph(base_family = 'Helvetica') +
guides(
alpha = guide_legend(title="In-degree"), # Customizing Legend labels
color = guide_legend(title = "Party"),
shape = guide_legend(title = "Chamber"),
size = guide_legend(title = "In-degree")
)
## Warning: Removed 3782 rows containing missing values or values outside the scale range
## (`geom_text()`).
That looks good! And with a few easy steps we are able to create an impactful visualization and highlights interesting properties of this data.
We could have also set aesthetics based on different variables, like coloring the nodes by state instead of the party variable to highlight state clusters more prominently:
# Plot the graph where nodes are colored by state
ggraph(follower_layout) +
geom_edge_link(alpha = 0.01) +
geom_node_point(aes(color = state, # Color nodes based on state this time
shape = chamber)) + # Shape nodes based on chamber (House or Senate)
theme_graph()
Or by gender:
# Plot the graph where nodes are colored by state
ggraph(follower_layout) +
geom_edge_link(alpha = 0.01) +
geom_node_point(aes(color = gender, # Color nodes based on gender this time
shape = chamber # Shape nodes based on chamber (House or Senate)
)) +
theme_graph()
We can also color the edges by the edge type to mark if the tie is a same or cross state follower tie.
ggraph(follower_layout, layout = 'auto') +
geom_edge_link(aes(color = cross_state_tie), alpha = 0.02) + # Color ties based on tie type (Cross and Same state tie)
geom_node_point(aes(color = party3,
shape = chamber)) +
scale_color_manual("Party",
values = c(D = "dodgerblue", # Assign color for Democrat
R = "firebrick2", # Assign color for Republican
I = "yellow")) + # Assign color for Republican
scale_edge_color_manual(values = c("same state" = "black", "cross state" = "darkgreen")) + # Set custom colors for same state and cross state ties
theme_graph()