Hurricane Ida Outage on Twitter

Author: Joseph Holler, Elise Chan

Created: Fall 2019

Updated: Fall 2023

This analysis is based off of a previous study of Hurricane Dorian https://www.github.com/gis4dev/OR-Dorian. The original analysis was developed with assistance from:

Abstract

This study is a replication of Professor Holler’s study of Twitter activity during Hurricane Dorian in 2019, instead analyzing Twitter activity during Hurricane Ida in 2021. To assess Twitter activity, the study uses tweets with keywords related to the hurricane during this time period in the eastern United States. Tweets are analyzed for temporal distribution, networks, text content, and spatial distribution. Holler (2019) developed a Normalized Difference Tweet Index (NDTI) to measure relative abundance of tweets about a hurricane event in each county, normalizing by the number of non-hurricane tweets in each county. Due to changes in Twitter’s API availability that prevented new Twitter data from being collected for this study, this replication study uses county population to calculate the tweet rate for NDTI. To determine hotspots of Twitter activity, the study uses the Getis Ord G* method. Additional deviations from Holler (2019) include visualizing the geographic area of the search queries, using all counties that fall within the search area, and improving visualizations of analyses.

Set up environment

Load the R project saved in the root directory of this repository, so that the working directory is the root directory of the repository.

# list of required packages
packages <- c(
  "here", "svDialogs", "tidyverse",
  "tidytext", "tm", "igraph", "ggraph",
  "tidycensus", "sf", "spdep"
)

if(!require(groundhog)){
  install.packages("groundhog")
  require(groundhog)
}

if(!require(here)){
  install.packages("here")
  require(here)
}

# install r version from 9-10-2021
groundhog.day <- "2021-09-10"
set.groundhog.folder(here("data", "scratch", "groundhog"))

groundhog.library(packages, groundhog.day)

# load and install required packages
package_check <- lapply(
  packages,
  FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
      install.packages(x, dependencies = TRUE, quietly = TRUE)
      library(x, character.only = TRUE)
    }
  }
)

# Load the 'remotes' package
#install.packages("remotes")
library(remotes)

# install a previous version of rtweet
# in 2021, would have been version 0.7.0.

# detach("package:rtweet", unload = TRUE) # run this line if a current version of rtweet is already installed
library(devtools)
install_version("rtweet", version = "0.7.0",
                repos = "https://cran.r-project.org")
library(rtweet)

# save the R processing environment
# writeLines(
#   capture.output(sessionInfo()),
#   here("procedure", "environment", "r_environment.txt")
# )
# load and install required packages

packages <- c(
  "here", "svDialogs", "tidyverse",
  "tidytext", "tm", "igraph", "ggraph",
  "tidycensus", "sf", "spdep", "here",
  "remotes", "devtools"
)

package_check <- lapply(
  packages,
  FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
      install.packages(x, dependencies = TRUE, quietly = TRUE)
      library(x, character.only = TRUE)
    }
  }
)
## Loading required package: here
## here() starts at C:/Users/elisec/Documents/GitHub/Rpl-Ida
## Loading required package: svDialogs
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: tidytext
## 
## Loading required package: tm
## 
## Loading required package: NLP
## 
## 
## Attaching package: 'NLP'
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## 
## 
## Loading required package: igraph
## 
## 
## Attaching package: 'igraph'
## 
## 
## The following objects are masked from 'package:lubridate':
## 
##     %--%, union
## 
## 
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## 
## 
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## 
## 
## The following object is masked from 'package:tidyr':
## 
##     crossing
## 
## 
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## 
## 
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## 
## 
## The following object is masked from 'package:base':
## 
##     union
## 
## 
## Loading required package: ggraph
## 
## Loading required package: tidycensus
## 
## Loading required package: sf
## 
## Linking to GEOS 3.11.2, GDAL 3.6.2, PROJ 9.2.0; sf_use_s2() is TRUE
## 
## Loading required package: spdep
## 
## Loading required package: spData
## 
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
##      (status 2 uses the sf package in place of rgdal)
## 
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
## 
## Loading required package: remotes
## 
## Loading required package: devtools
## 
## Loading required package: usethis
## 
## 
## Attaching package: 'usethis'
## 
## 
## The following object is masked from 'package:remotes':
## 
##     git_credentials
## 
## 
## 
## Attaching package: 'devtools'
## 
## 
## The following objects are masked from 'package:remotes':
## 
##     dev_package_deps, install_bioc, install_bitbucket, install_cran,
##     install_deps, install_dev, install_git, install_github,
##     install_gitlab, install_local, install_svn, install_url,
##     install_version, update_packages
# install old version of rtweet

# install_version("rtweet", version = "0.7.0",
#                 repos = "https://cran.r-project.org")
library(rtweet)
## 
## Attaching package: 'rtweet'
## 
## The following object is masked from 'package:purrr':
## 
##     flatten

Load data from Census

Census API

First, you will need a Census API. You can sign up for one here: https://api.census.gov/data/key_signup.html

census_api_key(dlgInput(
  "Enter a Census API Key",
  Sys.getenv("CENSUS_API_KEY")
)$res,
overwrite = TRUE,
install = TRUE
)
## Your original .Renviron will be backed up and stored in your R HOME directory if needed.
## Your API key has been stored in your .Renviron and can be accessed by Sys.getenv("CENSUS_API_KEY"). 
## To use now, restart R or run `readRenviron("~/.Renviron")`
counties <- get_acs(
  "county",
  variables = "B01003_001", # total population variable
  year = 2021,
  output = "wide",
  geometry = TRUE,
  keep_geo_vars = TRUE
)
## Getting data from the 2017-2021 5-year ACS
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.

Select and save counties of interest

States that overlap with the search area are selected by FIPS code.

# select states to include
states <- c(
    "54", "51", "50", "47", "45", "44", "42", "39", "37", "36", "05", "01",
    "34", "33", "29", "28", "25", "24", "23", "22", "21", "18", "17", "13",
    "12", "11", "10", "09", "48", "40", "20", "19", "26", "55", "27", "38",
    "46", "31", "35", "08")

counties <- filter(
  counties,
  STATEFP %in% states
) %>% 
  rename(POP = "B01003_001E", MOE = "B01003_001M") %>% 
  mutate(DENSITY = POP/ (ALAND/ (2.59*10^6))) # calculate density, people per square mile

#saveRDS(counties, here("data", "derived", "public", "counties.RDS"))

Load counties

Optionally, load counties from the counties.RDS file saved with the repository.

counties <- readRDS(here("data", "derived", "public", "counties.RDS"))

Search query geographic region

The search queries for the Hurricane Ida tweets were filtered by a geographic region of a 1000mile radius around (36, -87). Filter the counties data set to select only the counties that fall in this search area.

# set search query area
center <- data.frame(lon = -87, lat = 36)
center_sf <- st_as_sf(center, coords = c("lon","lat"), crs = 4269)
search_buff <- st_buffer(center_sf, 1609340) # 1000mi in meters

# select counties within search radius
options(warn = 0)
counties <- counties %>% 
  mutate(overlap = st_intersects(counties$geometry, search_buff),
         overlap = as.numeric(overlap)) %>% 
  filter(overlap == 1)

#saveRDS(counties, here("data", "derived", "public", "searched_counties.RDS"))

Load Counties

Optionally, load the counties from the searched_counties.RDS file saved with the repository.

counties <- readRDS(here("data", "derived", "public", "searched_counties.RDS"))

Visualize search region with hurricane track

# load hurricane track from storms dataset in dplyr
# storms data comes from NOAA HURDAT2 database
hurricane <- storms %>% 
  filter(year == 2021) %>% 
  filter(name == "Ida") %>% 
  mutate(date = make_date(year,month,day))
hurricane_sf <- st_as_sf(hurricane, coords = c("long","lat"), crs = 4269)

# plot query and hurricane track
ggplot() +
  geom_sf(data = counties, color = "white", fill="gray", lwd = 0.05) +
  geom_sf(data = search_buff, color = "red", fill = NA) +
  geom_sf(data = hurricane_sf, aes(color=date)) +
  labs(title = "Twitter Search Query Region",
       color = "Hurricane Ida Track") +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.title.x = element_blank(),
    axis.title.y = element_blank()
  ) +
  theme_minimal()

#ggsave("search_viz.jpg", plot = last_plot(), path = here("results", "figures"), device = "jpeg")

References for visualizing the storm track:

Load Data from Twitter

Note: Twitter (now X) no longer offers a free API for pulling large search queries of tweets. For this study, I used tweets that were retrieved from Twitter in September 2021 by Professor Holler using this code. The following code will no longer work but remains included for transparency of the data retrieval methods.

Set up Twitter API

You will need Twitter Developer API access to run this analysis: See https://cran.r-project.org/web/packages/rtweet/vignettes/auth.html for instructions.

reference for search_tweets function: https://rtweet.info/reference/search_tweets.html - don’t add any spaces in between variable name and value for your search
e.g. n=1000 is better than n = 1000 - the first parameter in quotes is the search string - n=10000 asks for 10,000 tweets - if you want more than 18,000 tweets, change retryonratelimit to TRUE and wait 15 minutes for every batch of 18,000 - include_rts=FALSE excludes retweets. - token refers to the twitter token you defined above for access to your twitter developer account - geocode is equal to a string with three parts: longitude, latitude, and distance with the units mi for miles or km for kilometers

This code block will ask you for your twitter application name, key, and secret. Then it will launch a web browser and prompt you to log in to Twitter to authenticate the application.

Never save your API keys in code where it can be committed and synced to GitHub! The code below is configured to save your keys in the environment, and this Git repository is set up to ignore the R environment data file.

# Twitter application values
twitter_vars <- list(
  app = "enter Twitter application name",
  key = "enter Twitter API key",
  secret = "enter Twitter API secret key"
)

# if Twitter token has already been created, auto-fill dialogue with its values
if (exists("twitter_token")) {
  twitter_vars$app <- twitter_token$app$appname
  twitter_vars$key <- twitter_token$app$key
  twitter_vars$secret <- twitter_token$app$secret
}

twitter_token <- create_token(
  app = dlgInput("Twitter App Name:", twitter_vars$app)$res,
  consumer_key = dlgInput("Consumer Key:", twitter_vars$key)$res,
  consumer_secret = dlgInput("Consumer Secret:", twitter_vars$secret)$res,
  access_token = NULL,
  access_secret = NULL
)

Pre-processing

  • Acquire Twitter data for analysis
  • Filter Twitter data for good geographic information and convert to Lat/Long coordinates

Search for Hurricane event tweets

Get tweets for Hurricane Ida, searched on 02-Sept-2021

tevent_raw <- search_tweets("ida OR hurricane",
  n = 200000, include_rts = TRUE,
  token = twitter_token,
  geocode = "36,-87,1000mi",
  retryonratelimit = TRUE
)

# write status id's for results of the original twitter search
write.table(tevent_raw$status_id,
  here("data", "raw", "public", "teventids.txt"),
  append = FALSE, quote = FALSE, row.names = FALSE, col.names = FALSE
)

saveRDS(tevent_raw, here("data", "raw", "private", "tevent_raw.RDS") )

Second search for more temporal coverage

Get tweets for Hurricane Ida, searched on 05-Sept-2021

tevent_raw2 <- search_tweets("ida OR hurricane",
  n = 200000, include_rts = TRUE,
  token = twitter_token,
  geocode = "36,-87,1000mi",
  retryonratelimit = TRUE
)

# write status id's for results of the original twitter search
write.table(tevent_raw2$status_id,
  here("data", "raw", "public", "teventids2.txt"),
  append = FALSE, quote = FALSE, row.names = FALSE, col.names = FALSE
)

saveRDS(tevent_raw2, here("data", "raw", "private", "tevent_raw2.RDS") )

Third search for more temporal coverage

Get tweets for Hurricane Ida, searched on 10-Sept-2021

tevent_raw3 <- search_tweets("ida OR hurricane",
  n = 200000, include_rts = TRUE,
  token = twitter_token,
  geocode = "36,-87,1000mi",
  retryonratelimit = TRUE
)

# write status id's for results of the original twitter search
write.table(tevent_raw3$status_id,
  here("data", "raw", "public", "teventids3.txt"),
  append = FALSE, quote = FALSE, row.names = FALSE, col.names = FALSE
)

saveRDS(tevent_raw3, here("data", "raw", "private", "tevent_raw3.RDS") )

Fourth search for different thematic coverage

Get tweets for Hurricane Ida, searched on 10-Sept-2021

tevent_raw4 <- search_tweets("ida OR flood OR electricity OR recovery OR outage",
  n = 200000, include_rts = TRUE,
  token = twitter_token,
  geocode = "36,-87,1000mi",
  retryonratelimit = TRUE
)

# write status id's for results of the original twitter search
write.table(tevent_raw4$status_id,
  here("data", "raw", "public", "teventids4.txt"),
  append = FALSE, quote = FALSE, row.names = FALSE, col.names = FALSE
)

saveRDS(tevent_raw4, here("data", "raw", "private", "tevent_raw4.RDS") )

Search for generic tweets after hurricane season

Note: Instead of using generic tweets to normalize the hurricane tweets, this study uses Census population data. The following code shows how one would acquire generic tweets to normalize the hurricane tweets.

Get tweets without any text filter for the same geographic region in tdcontrol. the query searches for all verified or unverified tweets, i.e. everything

Warning: this code will no longer result in the same data! It is here for reference or replication work only.

tdcontrol_raw <- search_tweets("-filter:verified OR filter:verified",
  n = 200000, include_rts = FALSE,
  token = twitter_token,
  geocode = "32,-78,1000mi",
  retryonratelimit = TRUE
)

# write status id's for results of the original twitter search
write.table(tdcontrol_raw$status_id,
  here("data", "raw", "public", "tdcontrolids.txt"),
  append = FALSE, quote = FALSE, row.names = FALSE, col.names = FALSE
)

For the original Hurricane Dorian study, generic tweets were searched for after the hurricane on November 19, 2019. I tried using this data to normalize the Hurricane Ida tweets in addition to the Census data to see if the results are similar to the Census results (i.e. is the number of tweets from a random day a possible way to normalize the hurricane tweets in the absence of new Twitter data.)

november <- readRDS(here("data", "derived", "private", "november.RDS"))

Rehydrate the twitter data

Twitter does not permit redistribution of Twitter data, with the exception of tweet status ids. For the purposes of transparency and reproducibility, researchers may include a list of status id’s with their publication. The process of using those status ids to query Twitter for the full tweet data is called rehydrating—like going back to the analogous fire hose of big data.

Warning: Twitter users and individual tweets can be deleted over time, therefore the results of rehydration will be similar, but not identical to, the original Twitter data used for this research.


Warning: It may take more than an hour to rehydrate the raw tweets with Twitter queries, therefore you may select to load only the derived status ids, which have filtered only the tweets with valid geographic data (approximately one tenth of the raw tweets)

Load Twitter status ids

# load tweet status id's for Hurricane event search results

filtered <- dlgList(
  choices = c("raw", "derived"), 
  title = "Which Dorin ids?")$res

teventids <-
  data.frame(read.table(here("data", filtered, "public", "teventids.txt"),
    numerals = "no.loss"
  ))

filtered <- dlgList(
  choices = c("raw", "derived"), 
  title = "Which Dorin ids?")$res

# load cleaned status id's for tdcontrol general twitter search
tdcontrolids <-
  data.frame(read.table(here("data", "raw", "public", "tdcontrolids.txt"),
    numerals = "no.loss"
  ))

Rehydrate Twitter status ids

This operation may take over an hour to run on all of the raw tweets.

# rehydrate event tweets
tevent_raw <- rehydratoR(twitter_token$app$key, twitter_token$app$secret,
  twitter_token$credentials$oauth_token,
  twitter_token$credentials$oauth_secret, teventids,
  base_path = NULL, group_start = 1
)

# rehydrate tdcontrol tweets
tdcontrol_raw <- rehydratoR(twitter_token$app$key, twitter_token$app$secret,
  twitter_token$credentials$oauth_token,
  twitter_token$credentials$oauth_secret, tdcontrolids,
  base_path = NULL, group_start = 1
)

Load the original search results

tevent_raw <- readRDS(here("data", "raw", "private", "tevent_raw.RDS"))
tevent_raw2 <- readRDS(here("data", "raw", "private", "tevent_raw2.RDS"))
tevent_raw3 <- readRDS(here("data", "raw", "private", "tevent_raw3.RDS"))
tevent_raw4 <- readRDS(here("data", "raw", "private", "tevent_raw4.RDS"))

Process geographic data in tweets

Combine searches

tevent_raw <- rbind(tevent_raw, tevent_raw2, tevent_raw3, tevent_raw4) %>% 
  distinct(status_id, .keep_all=TRUE) # remove duplicates
rm(tevent_raw2, tevent_raw3, tevent_raw4)
count(tevent_raw, place_type)
## # A tibble: 6 × 2
##   place_type        n
##   <chr>         <int>
## 1 admin          3541
## 2 city          15454
## 3 country          25
## 4 neighborhood     41
## 5 poi             477
## 6 <NA>         355805

Convert geographic information into lat/long coordinates

If you have loaded filtered status ids, or you have already run this code, you will not notice a difference in place_type or n because the data has already been processed.

First, convert coords_coords (from location services) to lat,lng coordinates.
Second, filter only the records with location services-based lat and long, or place designated to the city, neighborhood, or POI level.
Third, convert the place to lat,lng. By default, this does not overwrite lat and long

Caution: Do not use geo_coords or lat/lng will be inverted!

# convert geographic information for event into lat,lng coordinates
tevent <- tevent_raw %>% 
  lat_lng(coords = c("coords_coords")) %>% 
  subset(place_type == "city" | place_type == "neighborhood" | 
    place_type == "poi" | !is.na(lat)
  ) %>% 
  lat_lng(coords = c("bbox_coords"))
  
# re-check counts of place types
count(tevent, place_type)
## # A tibble: 6 × 2
##   place_type       n
##   <chr>        <int>
## 1 admin          497
## 2 city         15454
## 3 country          1
## 4 neighborhood    41
## 5 poi            477
## 6 <NA>            10

Convert geographic information to lat,lng for control tweets.

Warning: this code could be run if a generic tweet search query could be used. This data was not available for this iteration of this study.

# convert geographic information for control into lat,lng coordinates
tdcontrol <- tdcontrol_raw %>% 
  lat_lng(coords = c("coords_coords")) %>% 
  subset(place_type == "city" | place_type == "neighborhood" | 
    place_type == "poi" | !is.na(lat)
  ) %>% 
  lat_lng(coords = c("bbox_coords"))

Save processed tweets

Optionally, save the tweet id’s to the data\derived\public folder as plain text.
Save the full tweet data to data\derived\private folder as RDS files.
Full Tweet data cannot be shared with the public, therefore it is stored in a folder ignored by Git.

# save event data
write.table(tevent$status_id,
  here("data", "derived", "public", "teventids.txt"),
  append = FALSE, quote = FALSE, row.names = FALSE, col.names = FALSE
)
saveRDS(tevent, here("data", "derived", "private", "tevent_all.RDS"))

# save control data
write.table(tdcontrol$status_id,
  here("data", "derived", "public", "tdcontrolids.txt"),
  append = FALSE, quote = FALSE, row.names = FALSE, col.names = FALSE
)
saveRDS(tdcontrol, here("data", "derived", "private", "tdcontrol.RDS"))

Load processed tweets

Optionally, load processed twitter data

tevent <- readRDS(here("data", "derived", "private", "tevent_all.RDS"))

Temporal Analysis

Graph the number of tweets about Hurricane Ida per hour over the course of the hurricane.

tevent_tweets_by_hour <- ts_data(tevent, by = "hours")

ggplot()+
  geom_line(data=tevent_tweets_by_hour, aes(x=time,y=n), color="black") +
  xlab("Time") +
  ylab("Number of tweets") +
  labs(title="Hurricane Ida Tweets per Hour (2021)") +
  theme_minimal()

# ggsave("tweets_by_hour.jpg", plot = last_plot(), path = here("results", "figures"), device = "jpeg")

The first spike occurs on August 29, 2021 when the hurricane made landfall in Louisiana. The second largest spike occurs on September 3, 2021 as the hurricane petered out of the north Atlantic coast of the US.

Network Analysis

Visualize network of retweets between Twitter users who tweeted about Hurricane Ida.

tevent_network <- tevent %>% 
  network_graph("retweet,quote") %>% 
  simplify(remove.multiple = FALSE)

tevent_network <- delete.vertices(
  tevent_network,
  degree(tevent_network, mode="in") < 1
  ) %>% 
  simplify()

tevent_network <- delete.vertices(
  tevent_network,
  degree(tevent_network) == 0
  )

plot.igraph(
    tevent_network,
    vertex.size = 4,
    vertex.label = ifelse(
      degree(tevent_network) > 1, 
      V(tevent_network)$name, ""), 
    vertex.label.cex = 0.8, 
    edge.arrow.mode = "->",
    edge.arrow.size = 0.2
  )

Most of these twitter users are news anchors, news outlets, or meteorologists. The graph shows who has retweeted another user more than once.

Text Analysis

Clean the text data

Parse the tweet data for plain language, and parse tweet text into words. Remove stop words and Hurricane Ida search terms.

# remove urls, fancy formatting, etc. in other words, clean the text content
tevent_text <- tevent %>%
  select(text) %>%
  plain_tweets()

# parse out words from tweet text
tevent_words <- tevent_text %>% unnest_tokens(word, text)

# how many words do you have including the stop words?
word_count <- list(before = count(tevent_words)$n)

# create list of stop words (useless words not worth analyzing)
data("stop_words")

# add "t.co" twitter links to the list of stop words
# also add the twitter search terms to the list
# it would have been better to store a list of search terms to use here
stop_words <- stop_words %>%
  add_row(word = "t.co", lexicon = "SMART") %>%
  add_row(word = "hurricane", lexicon = "Search") %>%
  add_row(word = "ida", lexicon = "Search") %>%
  add_row(word = "sharpiegate", lexicon = "Search") %>%
  add_row(word = "hurricaneida", lexicon = "Search")

# delete stop words from tevent_words with an anti_join
tevent_words <- anti_join(tevent_words, stop_words, by="word")

# how many words after removing the stop words?
word_count <- append(
  word_count,
  list(after = count(tevent_words)$n)
  )
print(word_count)
## $before
## [1] 398757
## 
## $after
## [1] 190154

Graph frequencies of words

tevent_words %>%
  count(word, sort = TRUE) %>%
  slice_head(n = 15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(
    x = "Count",
    y = "Unique words",
    title = "Count of unique words found in tweets"
  ) +
  theme_minimal()

Analyze and graph word association

# separate words and count frequency of word pair occurrence in tweets
tevent_word_pairs <- tevent_text %>%
  mutate(text = removeWords(tolower(text), stop_words$word)) %>%
  unnest_tokens(paired_words, text, token = "ngrams", n = 2) %>%
  separate(paired_words, c("word1", "word2"), sep = " ") %>%
  count(word1, word2, sort = TRUE)

# graph a word cloud with space indicating association
tevent_word_pairs %>%
  filter(n >= 30 & !is.na(word1) & !is.na(word2)) %>% # choose how many instances a word has to be included on graph
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
  geom_node_point(color = "darkslategray4", size = 3) +
  geom_node_text(aes(label = name), vjust = 1.8, size = 3, color = "darkorange3") +
  labs(
    title = "Word Network of Tweets during Hurricane Ida",
    x = "", y = ""
  ) +
  theme(
    plot.background = element_rect(
    fill = "grey95",
    colour = "black",
    size = 1
    ),
    legend.background = element_rect(fill = "grey95")
  )

Two main clusters emerge: one about utilities and the other about storm behavior. The utilities clusters has closely related words like power, outage, grid, event. The storm cluster includes words like flooding, waters, insurance, and warning. A smaller cluster establishes links between words “louisiana”, “orleans”, “makes”, “landfall”. The smallest clusters of word pairs are often words that are used together in a phrase like “climate change”, “President Biden”, or “Grand Isle”.

Spatial Analysis

Map Population Density and Tweet Points

ggplot() +
  geom_sf(data = counties, 
    aes(fill = cut_number(DENSITY, 5)), color = "grey") +
  scale_fill_brewer(palette = "GnBu") +
  guides(fill = guide_legend(title = "Population Density")) +
  geom_point(
    data = tevent, aes(x = lng, y = lat),
    colour = "purple", alpha = 0.1, size = 1
  ) +
  geom_sf(data = search_buff, color = "red", fill = NA) +
  labs(title = "Tweet Locations During Hurricane Ida") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.title.x = element_blank(),
    axis.title.y = element_blank()
  )

  # + xlim(-110, -67) +
  # ylim(24, 47)

# ggsave("tweet_locations.jpg", plot = last_plot(), path = here("results", "figures"), device = "jpeg")

Plotting the tweets reveals the Twitter search radius does not exactly map the search. Twitter’s search methods are proprietary, so we can’t get more details about how they implement the search radius.

Join Tweets to Counties

Spatially join tevent tweets to counties with the following steps:

  1. Make point geometries from lat and lng columns
  2. Reproject points to NAD 1983 geographic coordinate system
  3. Join county GEOID to each tweet by location
  4. Drop geometry of points
  5. Group by county GEOID
  6. Count number of tweets by county
  7. Join number of tweets to counties
  8. Replace missing values with 0
  9. Calculate rate of spatial tweets per 10,000 people
tevent_sf <- tevent %>%
  # optional temporal filter:
  # filter(created_at > as.POSIXct("2021-08-30 00:00:00")) %>% 
  st_as_sf(coords = c("lng", "lat"), crs = 4326) %>% # make point geometries
  st_transform(4269) %>% # transform to NAD 1983
  st_join(select(counties, GEOID)) # spatially join county GEOID to each tweet

tevent_by_county <- tevent_sf %>%
  st_drop_geometry() %>% # drop geometry / make simple table
  group_by(GEOID) %>% # group by county using GEOID
  summarise(event_tweets = n()) # count # of tweets

counties <- counties %>%
  left_join(tevent_by_county, by = "GEOID") %>%
  mutate(event_tweets = replace_na(event_tweets, 0),
         twitter_pop = POP/10000,
         tweetrate = event_tweets / twitter_pop)

Calculate normalized tweet difference index: (event - control) / (event + control)

counties <- counties %>%
  mutate(ndti = (event_tweets - twitter_pop)/(event_tweets + twitter_pop)) %>% 
  mutate(ndti = replace_na(ndti, 0)) # replace NULLs with 0's

Save counties as counties_tweet_counts.RDS

# save counties geographic data with derived tweet rates
saveRDS(counties, here("data", "derived", "public", "counties_tweet_counts.RDS"))

Optionally, begin here by loading counties with Twitter data from counties_tweet_counts.RDS

counties <- readRDS(here("data", "derived", "public", "counties_tweet_counts.RDS"))

NDTI

Plot NDTI using population / 10000

ggplot() +
  geom_sf(data = counties, 
    aes(fill = cut_width(ndti, width = 0.4)), 
    color = "grey", lwd=0.01) +
  scale_fill_brewer(palette = "GnBu") +
  guides(fill = guide_legend(title = "Normalized Difference Tweet Index")) +
  labs(title = "Hurricane Ida Twitter Activity",
       subtitle = "Using population / 10,000 as control") +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.title.x = element_blank(),
    axis.title.y = element_blank()
  ) +
  # xlim(-105, -67) +
  # ylim(24, 47) +
  theme_minimal()

# ggsave("ndti_census.jpg", plot = last_plot(), path = here("results", "figures"), device = "jpeg")

Plot NDTI using November 2019 baseline

For the original Hurricane Dorian study, a query of generic tweets was searched in November 2019 and used to normalize the tweets. I use this dataset to calculate the NDTI to see how different it is from the Census population results.

# make november data sf compatible
nov_by_county <- november %>%
  st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
  st_transform(4269) %>%
  st_join(select(counties, GEOID)) %>%
  st_drop_geometry() %>%
  group_by(GEOID) %>%
  summarise(nov = n())

# join november data to counties dataset
counties <- counties %>%
  left_join(nov_by_county, by = "GEOID")

# calculate NDTI 
counties <- counties %>%
  mutate(ndti_nov = (event_tweets - nov) / (event_tweets + nov)) %>% # normalized diff tweet index
  mutate(ndti_nov = replace_na(ndti_nov, 0))  # replace NULLs with 0's
ggplot() +
  geom_sf(data = counties, 
    aes(fill = cut_width(ndti_nov, width = 0.4)), 
        color = "grey", lwd=0.01) +
  scale_fill_brewer(palette = "GnBu") +
  guides(fill = guide_legend(title = "Normalized Difference Tweet Index")) +
  labs(title = "Hurricane Ida Twitter Activity",
       subtitle = "Using tweets in November 2019 as control") +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.title.x = element_blank(),
    axis.title.y = element_blank()
  ) +
  xlim(-105, -67) +
  ylim(24, 47) +
  theme_minimal()

# ggsave("ndti_novtweets.jpg", plot = last_plot(), path = here("results", "figures"), device = "jpeg")

The NDTI normalized by the actual November 2019 Twitter data has many more counties with an NDTI around 0. This suggests that these counties had low Twitter activity at this time, and compared the Census, there were fewer tweets in these counties than than population / 10000. Clustered areas of highest NDTI are southern Louisiana and Mississippi and the northeast megalopolis.

Spatial Cluster Analysis

Create Spatial Weight Matrix

Use 110km Euclidean distance and include self in the weight matrix.

county_coords <- counties %>%
  st_centroid() %>% # convert polygons to centroid points
  st_coordinates() # convert to simple x,y coordinates to play with stdep

thresdist <- county_coords %>% 
  dnearneigh(0, 110, longlat = TRUE) %>% # use geodesic distance of 110km
  # distance should be long enough for every feature to have >= one neighbor
  include.self() # include a county in its own neighborhood (for G*)

thresdist # view statistical summary of the nearest neighbors
## Neighbour list object:
## Number of regions: 2683 
## Number of nonzero links: 66743 
## Percentage nonzero weights: 0.9271807 
## Average number of links: 24.87626

Optionally, plot the spatial weight matrix results This should result in a very dense graph because each county is connected to all other counties within 110 km.

plot(thresdist, county_coords, lwd=0.1) # plot nearest neighbor ties

Calculate Getis-Ord G* Statistic

# Create weight matrix from the neighbor objects
dwm <- nb2listw(thresdist, zero.policy = T)

# Get Ord G* statistic for hot and cold spots
counties$locG <- counties$tweetrate %>% 
  localG(listw = dwm, zero.policy = TRUE) %>% 
  as.vector()

# check summary statistics of the local G score
summary(counties$locG)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.20949 -0.70955 -0.49303 -0.02718 -0.14245 21.53366

Map Hotpots

Classify G scores by significance values typical of Z-scores where 1.15 is at the 0.125 confidence level, and 1.95 is at the 0.05 confidence level for two tailed z-scores based on Getis and Ord (1995) doi: 10.1111/j.1538-4632.1992.tb00261.x

To find other critical values, use the qnorm() function as shown here: https://methodenlehre.github.io/SGSCLM-R-course/statistical-distributions.html

Breaks and colors from http://michaelminn.net/tutorials/r-point-analysis/ based on 1.96 as the 95% confidence interval for z-scores.

# classify by significance levels
siglevel <- c(1.15, 1.95)
counties <- counties %>%
  mutate(sig = cut(locG, c(
    min(counties$locG),
    siglevel[2] * -1,
    siglevel[1] * -1,
    siglevel[1],
    siglevel[2],
    max(counties$locG)
  )))

# map results
ggplot() +
  geom_sf(data = counties, aes(fill = sig), color = "white", lwd = 0.1) +
  scale_fill_manual(
    values = c("#0000FF80", "#8080FF80", "#FFFFFF80", "#FF808080", "#FF000080"),
    labels = c("low", "", "insignificant", "", "high"),
    aesthetics = "fill"
  ) +
  guides(fill = guide_legend(title = "Hot Spots")) +
  labs(title = "Clusters of Hurricane Ida Twitter Activity") +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.title.x = element_blank(),
    axis.title.y = element_blank()
  )

# ggsave("tweet_hotspots.jpg", plot = last_plot(), path = here("results", "figures"), device = "jpeg")

Results

Temporal Analysis

The temporal analysis of tweets per hour over the course of the hurricane revealed the highest activity occurred on August 29, 2021. This coincides with the hurricane’s peak intensity as it made landfall near Grand Isle, Louisiana. The second highest tweet frequency occurred on September 3, shortly before the hurricane dissipated in the North Atlantic.

Network Analysis

The network analysis of retweets of Hurricane Ida tweets revealed that news anchors/analysts (@Elisa_Raffa, @DrRobertCollins, ), news outlets (National Hurricane Center @NHC_Atlantic, the Louisiana CBS channel WWL-TV @WWLTV), and meteorologists/weather enthusiasts (@skydrama, @StormTrackBrian) were most frequently retweeted.

Text Analysis

Two main clusters emerge: one about utilities and the other about storm behavior. The utilities clusters has closely related words like power, outage, grid, event. The storm cluster includes words like flooding, waters, insurance, and warning. A smaller cluster establishes links between words “louisiana”, “orleans”, “makes”, “landfall”. The smallest clusters of word pairs are often words that are used together in a phrase like “climate change”, “President Biden”, and “Grand Isle”. The high usage of many Louisiana-related words (Grand Isle, New Orleans) may have occurred during the highest Twitter activity when the hurricane passed through Louisiana.

Spatial Analysis

Both NDTI calculations revealed high Twitter activity along the Louisiana and Mississippi coast and the greater New York metropolitan area. The main difference between the two normalization methods is the most frequent value of counties outside of high NDTI areas. Using the Census population NDTI, most counties have an NDTI of -1 to -0.6, whereas most counties using the November 2019 generic tweets NDTI have an NDTI of -0.2-0.2.

Four major hotspots emerged on the map generated by Getis Ord G*: Louisiana & Mississippi, New York metropolitan area, Iowa, and western Kansas.

Discussion

Twitter Data

The plot of Hurricane Ida tweet locations reveals that many tweets in the data set are outside of the specified radius for the search query. This visual alludes to the uncertainty of the Twitter data quality. The information about Twitter’s methods are proprietary, so we cannot find how this geographic search is conducted. Very few of the tweets had exact coordinates, so most of the coordinates associated with tweets in the data are the centroid of a city or neighborhood. It is also unclear how accurate the tweet location Twitter provides are. This is all to say that before even discussing the spatial analysis conducted in this study, the geographic data contains a lot of uncertainty.

NDTI and Hotspots

In the absence of a generic Twitter search query to normalize the hurricane-related tweets, I used the county population divided by 10000. Comparing this NDTI to the November 2019 tweets normalization and hotspot maps, all three maps identify clusters of high Twitter activity about Hurricane Ida in Louisiana & the Mississippi coast and the New York metropolitan area. Both of these areas were heavily impacted by the hurricane. The hurricane made landfall on the coast of Louisiana, and New York City experienced flooding due to rain the broke the city’s record for most single-hour rainfall.

The November 2019 search had a different geographic area than the Hurricane Ida tweet searches, so there are no tweets for counties on the western-most part of the map. While the exact NDVI values are not valid, plotting this map shows that using population / 10000 yields can be used in place of a generic tweet baseline and still yield similar results.

While developing this study, I used different spatial extents for the counties (such as including all counties of states that intersected with the search area, so counties outside of the alleged search radius were included). Changing the spatial extent changed the results of the Getis Ord G* hotspot map. When I included more counties with no/low Hurricane Ida Twitter activity (i.e. those on the edges of the western edge of the search radius furthest from the storm), the lowest activity hotspots dropped off the map. This is due to lowering the average hurricane-related activity, so there are more low-activity areas and the lowest-activity becomes less salient. For this study, I was most interested in areas of highest Twitter activity, and these hotspots did not change with changes in spatial extent. Areas around the extent of the study area are also “less connected” in the Getis Ord G* spatial weight matrix due to being surrounded by future counties. This did not seem to affect the study results because the edges of the study area were over the ocean or far from the storm track (the main area of interest).

Future Additions

There are many additions that could be made to this study to enhance the geographic validity of the study. The following are changes I would have liked to implement with more time:

  • “Stem” words: The word frequency analysis revealed related words “flood” and “flooding”. These words have essentially the same meaning, so “stemming” the words in the data set could yield better insight into the topics of tweets. The quanteda text analysis package has a word-stemming function. Additionally, removing numbers and times (“1”, “4”, and “pm” made the top word list) could reveal more about tweet word usage.

  • Visualizing hotspot neighborhoods: visualize the Getis Ord G* neighborhood around the highest/lowest activity clusters on the hotspot map.

  • Population by age: Using population / 10000 may not be the best way to estimate tweet rate because the age of Twitter users is not evenly distributed. US Twitter users are predominantly under the age of 50, so using a subset of county population by age (available in the Census 5-year ACS used in this study) could be used instead.

Conclusion

I successfully replicated the Holler (2019) study of Hurricane Dorian Twitter activity for Hurricane Ida using the code from the original study. Changes were made to the original code to use population to calculate tweet rate and improve visualizations.

The original study of Hurricane Dorian was motivated by the Sharpiegate controversy in which President Trump mistakenly insisted on Twitter that Hurricane Dorian would hit Alabama, and inter-agency chaos ensued. It is difficult to assess the validity of information from social media such as Twitter, especially when authority figures are spreading the information.

The implications of this replication study are difficult to assert due to the lack of a free Twitter API that is required to replicate this study with different search terms or area. Analysis of Twitter activity during storms can reveal how people are adapting their collective response to ever-changing storm behavior using technology.

From an open-science perspective, the changes to Twitter’s API availability are decidedly not following the practices of open-science by limiting what data can be accessed and who can access such data. Despite implementing open-science practices in this study design, this study cannot be fully open due to the data source. This exemplifies the challenge of implementing open science when many of the resources employed in science are not yet openly available.

Acknowledgements

I collaborated with Andy Cao on this project. See his project on his Github site for a different take on this replication.

This report is based upon the template for Reproducible and Replicable Research in Human-Environment and Geographical Sciences, DOI:[10.17605/OSF.IO/W29MQ](DOI:%5B10.17605/OSF.IO/W29MQ){.uri}

References