Author: Joseph Holler, Elise Chan
Created: Fall 2019
Updated: Fall 2023
This analysis is based off of a previous study of Hurricane Dorian https://www.github.com/gis4dev/OR-Dorian. The original analysis was developed with assistance from:
Casey Lilley’s GEOG 323 final project available at: https://caseylilley.github.io/finalproj.html
Leah Wasser and Carson Farmer’s Twitter Data in R Using RTweet tutorial on EarthLab at: https://www.earthdatascience.org/courses/earth-analytics/get-data-using-apis/use-twitter-api-r/
Michael Minn’s Basic Spatial Point Pattern Analysis in R tutorial available at: http://michaelminn.net/tutorials/r-point-analysis/
This study is a replication of Professor Holler’s study of Twitter activity during Hurricane Dorian in 2019, instead analyzing Twitter activity during Hurricane Ida in 2021. To assess Twitter activity, the study uses tweets with keywords related to the hurricane during this time period in the eastern United States. Tweets are analyzed for temporal distribution, networks, text content, and spatial distribution. Holler (2019) developed a Normalized Difference Tweet Index (NDTI) to measure relative abundance of tweets about a hurricane event in each county, normalizing by the number of non-hurricane tweets in each county. Due to changes in Twitter’s API availability that prevented new Twitter data from being collected for this study, this replication study uses county population to calculate the tweet rate for NDTI. To determine hotspots of Twitter activity, the study uses the Getis Ord G* method. Additional deviations from Holler (2019) include visualizing the geographic area of the search queries, using all counties that fall within the search area, and improving visualizations of analyses.
Load the R project saved in the root directory of this repository, so that the working directory is the root directory of the repository.
# list of required packages
packages <- c(
"here", "svDialogs", "tidyverse",
"tidytext", "tm", "igraph", "ggraph",
"tidycensus", "sf", "spdep"
)
if(!require(groundhog)){
install.packages("groundhog")
require(groundhog)
}
if(!require(here)){
install.packages("here")
require(here)
}
# install r version from 9-10-2021
groundhog.day <- "2021-09-10"
set.groundhog.folder(here("data", "scratch", "groundhog"))
groundhog.library(packages, groundhog.day)
# load and install required packages
package_check <- lapply(
packages,
FUN = function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x, dependencies = TRUE, quietly = TRUE)
library(x, character.only = TRUE)
}
}
)
# Load the 'remotes' package
#install.packages("remotes")
library(remotes)
# install a previous version of rtweet
# in 2021, would have been version 0.7.0.
# detach("package:rtweet", unload = TRUE) # run this line if a current version of rtweet is already installed
library(devtools)
install_version("rtweet", version = "0.7.0",
repos = "https://cran.r-project.org")
library(rtweet)
# save the R processing environment
# writeLines(
# capture.output(sessionInfo()),
# here("procedure", "environment", "r_environment.txt")
# )
# load and install required packages
packages <- c(
"here", "svDialogs", "tidyverse",
"tidytext", "tm", "igraph", "ggraph",
"tidycensus", "sf", "spdep", "here",
"remotes", "devtools"
)
package_check <- lapply(
packages,
FUN = function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x, dependencies = TRUE, quietly = TRUE)
library(x, character.only = TRUE)
}
}
)
## Loading required package: here
## here() starts at C:/Users/elisec/Documents/GitHub/Rpl-Ida
## Loading required package: svDialogs
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: tidytext
##
## Loading required package: tm
##
## Loading required package: NLP
##
##
## Attaching package: 'NLP'
##
##
## The following object is masked from 'package:ggplot2':
##
## annotate
##
##
## Loading required package: igraph
##
##
## Attaching package: 'igraph'
##
##
## The following objects are masked from 'package:lubridate':
##
## %--%, union
##
##
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
##
##
## The following objects are masked from 'package:purrr':
##
## compose, simplify
##
##
## The following object is masked from 'package:tidyr':
##
## crossing
##
##
## The following object is masked from 'package:tibble':
##
## as_data_frame
##
##
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
##
##
## The following object is masked from 'package:base':
##
## union
##
##
## Loading required package: ggraph
##
## Loading required package: tidycensus
##
## Loading required package: sf
##
## Linking to GEOS 3.11.2, GDAL 3.6.2, PROJ 9.2.0; sf_use_s2() is TRUE
##
## Loading required package: spdep
##
## Loading required package: spData
##
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
## (status 2 uses the sf package in place of rgdal)
##
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
##
## Loading required package: remotes
##
## Loading required package: devtools
##
## Loading required package: usethis
##
##
## Attaching package: 'usethis'
##
##
## The following object is masked from 'package:remotes':
##
## git_credentials
##
##
##
## Attaching package: 'devtools'
##
##
## The following objects are masked from 'package:remotes':
##
## dev_package_deps, install_bioc, install_bitbucket, install_cran,
## install_deps, install_dev, install_git, install_github,
## install_gitlab, install_local, install_svn, install_url,
## install_version, update_packages
# install old version of rtweet
# install_version("rtweet", version = "0.7.0",
# repos = "https://cran.r-project.org")
library(rtweet)
##
## Attaching package: 'rtweet'
##
## The following object is masked from 'package:purrr':
##
## flatten
First, you will need a Census API. You can sign up for one here: https://api.census.gov/data/key_signup.html
census_api_key(dlgInput(
"Enter a Census API Key",
Sys.getenv("CENSUS_API_KEY")
)$res,
overwrite = TRUE,
install = TRUE
)
## Your original .Renviron will be backed up and stored in your R HOME directory if needed.
## Your API key has been stored in your .Renviron and can be accessed by Sys.getenv("CENSUS_API_KEY").
## To use now, restart R or run `readRenviron("~/.Renviron")`
counties <- get_acs(
"county",
variables = "B01003_001", # total population variable
year = 2021,
output = "wide",
geometry = TRUE,
keep_geo_vars = TRUE
)
## Getting data from the 2017-2021 5-year ACS
## Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
States that overlap with the search area are selected by FIPS code.
# select states to include
states <- c(
"54", "51", "50", "47", "45", "44", "42", "39", "37", "36", "05", "01",
"34", "33", "29", "28", "25", "24", "23", "22", "21", "18", "17", "13",
"12", "11", "10", "09", "48", "40", "20", "19", "26", "55", "27", "38",
"46", "31", "35", "08")
counties <- filter(
counties,
STATEFP %in% states
) %>%
rename(POP = "B01003_001E", MOE = "B01003_001M") %>%
mutate(DENSITY = POP/ (ALAND/ (2.59*10^6))) # calculate density, people per square mile
#saveRDS(counties, here("data", "derived", "public", "counties.RDS"))
Optionally, load counties from the counties.RDS
file
saved with the repository.
counties <- readRDS(here("data", "derived", "public", "counties.RDS"))
The search queries for the Hurricane Ida tweets were filtered by a
geographic region of a 1000mile radius around (36, -87). Filter the
counties
data set to select only the counties that fall in
this search area.
# set search query area
center <- data.frame(lon = -87, lat = 36)
center_sf <- st_as_sf(center, coords = c("lon","lat"), crs = 4269)
search_buff <- st_buffer(center_sf, 1609340) # 1000mi in meters
# select counties within search radius
options(warn = 0)
counties <- counties %>%
mutate(overlap = st_intersects(counties$geometry, search_buff),
overlap = as.numeric(overlap)) %>%
filter(overlap == 1)
#saveRDS(counties, here("data", "derived", "public", "searched_counties.RDS"))
Optionally, load the counties from the
searched_counties.RDS
file saved with the repository.
counties <- readRDS(here("data", "derived", "public", "searched_counties.RDS"))
# load hurricane track from storms dataset in dplyr
# storms data comes from NOAA HURDAT2 database
hurricane <- storms %>%
filter(year == 2021) %>%
filter(name == "Ida") %>%
mutate(date = make_date(year,month,day))
hurricane_sf <- st_as_sf(hurricane, coords = c("long","lat"), crs = 4269)
# plot query and hurricane track
ggplot() +
geom_sf(data = counties, color = "white", fill="gray", lwd = 0.05) +
geom_sf(data = search_buff, color = "red", fill = NA) +
geom_sf(data = hurricane_sf, aes(color=date)) +
labs(title = "Twitter Search Query Region",
color = "Hurricane Ida Track") +
theme(
plot.title = element_text(hjust = 0.5),
axis.title.x = element_blank(),
axis.title.y = element_blank()
) +
theme_minimal()
#ggsave("search_viz.jpg", plot = last_plot(), path = here("results", "figures"), device = "jpeg")
References for visualizing the storm track:
Note: Twitter (now X) no longer offers a free API for pulling large search queries of tweets. For this study, I used tweets that were retrieved from Twitter in September 2021 by Professor Holler using this code. The following code will no longer work but remains included for transparency of the data retrieval methods.
You will need Twitter Developer API access to run this analysis: See https://cran.r-project.org/web/packages/rtweet/vignettes/auth.html for instructions.
reference for search_tweets function: https://rtweet.info/reference/search_tweets.html - don’t
add any spaces in between variable name and value for your search
e.g. n=1000 is better than n = 1000 - the first parameter in quotes is
the search string - n=10000
asks for 10,000 tweets - if you
want more than 18,000 tweets, change retryonratelimit
to
TRUE
and wait 15 minutes for every batch of 18,000 -
include_rts=FALSE
excludes retweets. - token
refers to the twitter token you defined above for access to your twitter
developer account - geocode
is equal to a string with three
parts: longitude, latitude, and distance with the units mi for miles or
km for kilometers
This code block will ask you for your twitter application name, key, and secret. Then it will launch a web browser and prompt you to log in to Twitter to authenticate the application.
Never save your API keys in code where it can be committed and synced to GitHub! The code below is configured to save your keys in the environment, and this Git repository is set up to ignore the R environment data file.
# Twitter application values
twitter_vars <- list(
app = "enter Twitter application name",
key = "enter Twitter API key",
secret = "enter Twitter API secret key"
)
# if Twitter token has already been created, auto-fill dialogue with its values
if (exists("twitter_token")) {
twitter_vars$app <- twitter_token$app$appname
twitter_vars$key <- twitter_token$app$key
twitter_vars$secret <- twitter_token$app$secret
}
twitter_token <- create_token(
app = dlgInput("Twitter App Name:", twitter_vars$app)$res,
consumer_key = dlgInput("Consumer Key:", twitter_vars$key)$res,
consumer_secret = dlgInput("Consumer Secret:", twitter_vars$secret)$res,
access_token = NULL,
access_secret = NULL
)
Get tweets for Hurricane Ida, searched on
02-Sept-2021
tevent_raw <- search_tweets("ida OR hurricane",
n = 200000, include_rts = TRUE,
token = twitter_token,
geocode = "36,-87,1000mi",
retryonratelimit = TRUE
)
# write status id's for results of the original twitter search
write.table(tevent_raw$status_id,
here("data", "raw", "public", "teventids.txt"),
append = FALSE, quote = FALSE, row.names = FALSE, col.names = FALSE
)
saveRDS(tevent_raw, here("data", "raw", "private", "tevent_raw.RDS") )
Get tweets for Hurricane Ida, searched on
05-Sept-2021
tevent_raw2 <- search_tweets("ida OR hurricane",
n = 200000, include_rts = TRUE,
token = twitter_token,
geocode = "36,-87,1000mi",
retryonratelimit = TRUE
)
# write status id's for results of the original twitter search
write.table(tevent_raw2$status_id,
here("data", "raw", "public", "teventids2.txt"),
append = FALSE, quote = FALSE, row.names = FALSE, col.names = FALSE
)
saveRDS(tevent_raw2, here("data", "raw", "private", "tevent_raw2.RDS") )
Get tweets for Hurricane Ida, searched on
10-Sept-2021
tevent_raw3 <- search_tweets("ida OR hurricane",
n = 200000, include_rts = TRUE,
token = twitter_token,
geocode = "36,-87,1000mi",
retryonratelimit = TRUE
)
# write status id's for results of the original twitter search
write.table(tevent_raw3$status_id,
here("data", "raw", "public", "teventids3.txt"),
append = FALSE, quote = FALSE, row.names = FALSE, col.names = FALSE
)
saveRDS(tevent_raw3, here("data", "raw", "private", "tevent_raw3.RDS") )
Get tweets for Hurricane Ida, searched on
10-Sept-2021
tevent_raw4 <- search_tweets("ida OR flood OR electricity OR recovery OR outage",
n = 200000, include_rts = TRUE,
token = twitter_token,
geocode = "36,-87,1000mi",
retryonratelimit = TRUE
)
# write status id's for results of the original twitter search
write.table(tevent_raw4$status_id,
here("data", "raw", "public", "teventids4.txt"),
append = FALSE, quote = FALSE, row.names = FALSE, col.names = FALSE
)
saveRDS(tevent_raw4, here("data", "raw", "private", "tevent_raw4.RDS") )
Note: Instead of using generic tweets to normalize the hurricane tweets, this study uses Census population data. The following code shows how one would acquire generic tweets to normalize the hurricane tweets.
Get tweets without any text filter for the same geographic region in tdcontrol. the query searches for all verified or unverified tweets, i.e. everything
Warning: this code will no longer result in the same data! It is here for reference or replication work only.
tdcontrol_raw <- search_tweets("-filter:verified OR filter:verified",
n = 200000, include_rts = FALSE,
token = twitter_token,
geocode = "32,-78,1000mi",
retryonratelimit = TRUE
)
# write status id's for results of the original twitter search
write.table(tdcontrol_raw$status_id,
here("data", "raw", "public", "tdcontrolids.txt"),
append = FALSE, quote = FALSE, row.names = FALSE, col.names = FALSE
)
For the original Hurricane Dorian study, generic tweets were searched for after the hurricane on November 19, 2019. I tried using this data to normalize the Hurricane Ida tweets in addition to the Census data to see if the results are similar to the Census results (i.e. is the number of tweets from a random day a possible way to normalize the hurricane tweets in the absence of new Twitter data.)
november <- readRDS(here("data", "derived", "private", "november.RDS"))
Twitter does not permit redistribution of Twitter data, with the exception of tweet status ids. For the purposes of transparency and reproducibility, researchers may include a list of status id’s with their publication. The process of using those status ids to query Twitter for the full tweet data is called rehydrating—like going back to the analogous fire hose of big data.
Warning: Twitter users and individual tweets can be deleted over time, therefore the results of rehydration will be similar, but not identical to, the original Twitter data used for this research.
Warning: It may take more than an hour to rehydrate the
raw tweets with Twitter queries, therefore you may select to load only
the derived status ids, which have filtered only the tweets with valid
geographic data (approximately one tenth of the raw tweets)
# load tweet status id's for Hurricane event search results
filtered <- dlgList(
choices = c("raw", "derived"),
title = "Which Dorin ids?")$res
teventids <-
data.frame(read.table(here("data", filtered, "public", "teventids.txt"),
numerals = "no.loss"
))
filtered <- dlgList(
choices = c("raw", "derived"),
title = "Which Dorin ids?")$res
# load cleaned status id's for tdcontrol general twitter search
tdcontrolids <-
data.frame(read.table(here("data", "raw", "public", "tdcontrolids.txt"),
numerals = "no.loss"
))
This operation may take over an hour to run on all of the raw tweets.
# rehydrate event tweets
tevent_raw <- rehydratoR(twitter_token$app$key, twitter_token$app$secret,
twitter_token$credentials$oauth_token,
twitter_token$credentials$oauth_secret, teventids,
base_path = NULL, group_start = 1
)
# rehydrate tdcontrol tweets
tdcontrol_raw <- rehydratoR(twitter_token$app$key, twitter_token$app$secret,
twitter_token$credentials$oauth_token,
twitter_token$credentials$oauth_secret, tdcontrolids,
base_path = NULL, group_start = 1
)
tevent_raw <- readRDS(here("data", "raw", "private", "tevent_raw.RDS"))
tevent_raw2 <- readRDS(here("data", "raw", "private", "tevent_raw2.RDS"))
tevent_raw3 <- readRDS(here("data", "raw", "private", "tevent_raw3.RDS"))
tevent_raw4 <- readRDS(here("data", "raw", "private", "tevent_raw4.RDS"))
tevent_raw <- rbind(tevent_raw, tevent_raw2, tevent_raw3, tevent_raw4) %>%
distinct(status_id, .keep_all=TRUE) # remove duplicates
rm(tevent_raw2, tevent_raw3, tevent_raw4)
count(tevent_raw, place_type)
## # A tibble: 6 × 2
## place_type n
## <chr> <int>
## 1 admin 3541
## 2 city 15454
## 3 country 25
## 4 neighborhood 41
## 5 poi 477
## 6 <NA> 355805
If you have loaded filtered status ids, or you have already run this
code, you will not notice a difference in place_type
or
n
because the data has already been processed.
First, convert coords_coords
(from location services) to
lat
,lng
coordinates.
Second, filter only the records with location services-based lat and
long, or place designated to the city, neighborhood, or POI level.
Third, convert the place to lat,lng. By default, this does not overwrite
lat and long
Caution: Do not use geo_coords
or lat/lng will
be inverted!
# convert geographic information for event into lat,lng coordinates
tevent <- tevent_raw %>%
lat_lng(coords = c("coords_coords")) %>%
subset(place_type == "city" | place_type == "neighborhood" |
place_type == "poi" | !is.na(lat)
) %>%
lat_lng(coords = c("bbox_coords"))
# re-check counts of place types
count(tevent, place_type)
## # A tibble: 6 × 2
## place_type n
## <chr> <int>
## 1 admin 497
## 2 city 15454
## 3 country 1
## 4 neighborhood 41
## 5 poi 477
## 6 <NA> 10
Convert geographic information to lat
,lng
for control tweets.
Warning: this code could be run if a generic tweet search query could be used. This data was not available for this iteration of this study.
# convert geographic information for control into lat,lng coordinates
tdcontrol <- tdcontrol_raw %>%
lat_lng(coords = c("coords_coords")) %>%
subset(place_type == "city" | place_type == "neighborhood" |
place_type == "poi" | !is.na(lat)
) %>%
lat_lng(coords = c("bbox_coords"))
Optionally, save the tweet id’s to the
data\derived\public
folder as plain text.
Save the full tweet data to data\derived\private
folder as
RDS files.
Full Tweet data cannot be shared with the public, therefore it is stored
in a folder ignored by Git.
# save event data
write.table(tevent$status_id,
here("data", "derived", "public", "teventids.txt"),
append = FALSE, quote = FALSE, row.names = FALSE, col.names = FALSE
)
saveRDS(tevent, here("data", "derived", "private", "tevent_all.RDS"))
# save control data
write.table(tdcontrol$status_id,
here("data", "derived", "public", "tdcontrolids.txt"),
append = FALSE, quote = FALSE, row.names = FALSE, col.names = FALSE
)
saveRDS(tdcontrol, here("data", "derived", "private", "tdcontrol.RDS"))
Optionally, load processed twitter data
tevent <- readRDS(here("data", "derived", "private", "tevent_all.RDS"))
Graph the number of tweets about Hurricane Ida per hour over the course of the hurricane.
tevent_tweets_by_hour <- ts_data(tevent, by = "hours")
ggplot()+
geom_line(data=tevent_tweets_by_hour, aes(x=time,y=n), color="black") +
xlab("Time") +
ylab("Number of tweets") +
labs(title="Hurricane Ida Tweets per Hour (2021)") +
theme_minimal()
# ggsave("tweets_by_hour.jpg", plot = last_plot(), path = here("results", "figures"), device = "jpeg")
The first spike occurs on August 29, 2021 when the hurricane made landfall in Louisiana. The second largest spike occurs on September 3, 2021 as the hurricane petered out of the north Atlantic coast of the US.
Visualize network of retweets between Twitter users who tweeted about Hurricane Ida.
tevent_network <- tevent %>%
network_graph("retweet,quote") %>%
simplify(remove.multiple = FALSE)
tevent_network <- delete.vertices(
tevent_network,
degree(tevent_network, mode="in") < 1
) %>%
simplify()
tevent_network <- delete.vertices(
tevent_network,
degree(tevent_network) == 0
)
plot.igraph(
tevent_network,
vertex.size = 4,
vertex.label = ifelse(
degree(tevent_network) > 1,
V(tevent_network)$name, ""),
vertex.label.cex = 0.8,
edge.arrow.mode = "->",
edge.arrow.size = 0.2
)
Most of these twitter users are news anchors, news outlets, or meteorologists. The graph shows who has retweeted another user more than once.
Parse the tweet data for plain language, and parse tweet text into words. Remove stop words and Hurricane Ida search terms.
# remove urls, fancy formatting, etc. in other words, clean the text content
tevent_text <- tevent %>%
select(text) %>%
plain_tweets()
# parse out words from tweet text
tevent_words <- tevent_text %>% unnest_tokens(word, text)
# how many words do you have including the stop words?
word_count <- list(before = count(tevent_words)$n)
# create list of stop words (useless words not worth analyzing)
data("stop_words")
# add "t.co" twitter links to the list of stop words
# also add the twitter search terms to the list
# it would have been better to store a list of search terms to use here
stop_words <- stop_words %>%
add_row(word = "t.co", lexicon = "SMART") %>%
add_row(word = "hurricane", lexicon = "Search") %>%
add_row(word = "ida", lexicon = "Search") %>%
add_row(word = "sharpiegate", lexicon = "Search") %>%
add_row(word = "hurricaneida", lexicon = "Search")
# delete stop words from tevent_words with an anti_join
tevent_words <- anti_join(tevent_words, stop_words, by="word")
# how many words after removing the stop words?
word_count <- append(
word_count,
list(after = count(tevent_words)$n)
)
print(word_count)
## $before
## [1] 398757
##
## $after
## [1] 190154
tevent_words %>%
count(word, sort = TRUE) %>%
slice_head(n = 15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(
x = "Count",
y = "Unique words",
title = "Count of unique words found in tweets"
) +
theme_minimal()
# separate words and count frequency of word pair occurrence in tweets
tevent_word_pairs <- tevent_text %>%
mutate(text = removeWords(tolower(text), stop_words$word)) %>%
unnest_tokens(paired_words, text, token = "ngrams", n = 2) %>%
separate(paired_words, c("word1", "word2"), sep = " ") %>%
count(word1, word2, sort = TRUE)
# graph a word cloud with space indicating association
tevent_word_pairs %>%
filter(n >= 30 & !is.na(word1) & !is.na(word2)) %>% # choose how many instances a word has to be included on graph
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
geom_node_point(color = "darkslategray4", size = 3) +
geom_node_text(aes(label = name), vjust = 1.8, size = 3, color = "darkorange3") +
labs(
title = "Word Network of Tweets during Hurricane Ida",
x = "", y = ""
) +
theme(
plot.background = element_rect(
fill = "grey95",
colour = "black",
size = 1
),
legend.background = element_rect(fill = "grey95")
)
Two main clusters emerge: one about utilities and the other about storm behavior. The utilities clusters has closely related words like power, outage, grid, event. The storm cluster includes words like flooding, waters, insurance, and warning. A smaller cluster establishes links between words “louisiana”, “orleans”, “makes”, “landfall”. The smallest clusters of word pairs are often words that are used together in a phrase like “climate change”, “President Biden”, or “Grand Isle”.
ggplot() +
geom_sf(data = counties,
aes(fill = cut_number(DENSITY, 5)), color = "grey") +
scale_fill_brewer(palette = "GnBu") +
guides(fill = guide_legend(title = "Population Density")) +
geom_point(
data = tevent, aes(x = lng, y = lat),
colour = "purple", alpha = 0.1, size = 1
) +
geom_sf(data = search_buff, color = "red", fill = NA) +
labs(title = "Tweet Locations During Hurricane Ida") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
axis.title.x = element_blank(),
axis.title.y = element_blank()
)
# + xlim(-110, -67) +
# ylim(24, 47)
# ggsave("tweet_locations.jpg", plot = last_plot(), path = here("results", "figures"), device = "jpeg")
Plotting the tweets reveals the Twitter search radius does not exactly map the search. Twitter’s search methods are proprietary, so we can’t get more details about how they implement the search radius.
Spatially join tevent
tweets to counties with the
following steps:
lat
and lng
columnsNAD 1983
geographic coordinate
systemGEOID
to each tweet by locationGEOID
tevent_sf <- tevent %>%
# optional temporal filter:
# filter(created_at > as.POSIXct("2021-08-30 00:00:00")) %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326) %>% # make point geometries
st_transform(4269) %>% # transform to NAD 1983
st_join(select(counties, GEOID)) # spatially join county GEOID to each tweet
tevent_by_county <- tevent_sf %>%
st_drop_geometry() %>% # drop geometry / make simple table
group_by(GEOID) %>% # group by county using GEOID
summarise(event_tweets = n()) # count # of tweets
counties <- counties %>%
left_join(tevent_by_county, by = "GEOID") %>%
mutate(event_tweets = replace_na(event_tweets, 0),
twitter_pop = POP/10000,
tweetrate = event_tweets / twitter_pop)
Calculate normalized tweet difference index:
(event - control) / (event + control)
counties <- counties %>%
mutate(ndti = (event_tweets - twitter_pop)/(event_tweets + twitter_pop)) %>%
mutate(ndti = replace_na(ndti, 0)) # replace NULLs with 0's
Save counties as counties_tweet_counts.RDS
# save counties geographic data with derived tweet rates
saveRDS(counties, here("data", "derived", "public", "counties_tweet_counts.RDS"))
Optionally, begin here by loading counties with Twitter data from
counties_tweet_counts.RDS
counties <- readRDS(here("data", "derived", "public", "counties_tweet_counts.RDS"))
ggplot() +
geom_sf(data = counties,
aes(fill = cut_width(ndti, width = 0.4)),
color = "grey", lwd=0.01) +
scale_fill_brewer(palette = "GnBu") +
guides(fill = guide_legend(title = "Normalized Difference Tweet Index")) +
labs(title = "Hurricane Ida Twitter Activity",
subtitle = "Using population / 10,000 as control") +
theme(
plot.title = element_text(hjust = 0.5),
axis.title.x = element_blank(),
axis.title.y = element_blank()
) +
# xlim(-105, -67) +
# ylim(24, 47) +
theme_minimal()
# ggsave("ndti_census.jpg", plot = last_plot(), path = here("results", "figures"), device = "jpeg")
For the original Hurricane Dorian study, a query of generic tweets was searched in November 2019 and used to normalize the tweets. I use this dataset to calculate the NDTI to see how different it is from the Census population results.
# make november data sf compatible
nov_by_county <- november %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
st_transform(4269) %>%
st_join(select(counties, GEOID)) %>%
st_drop_geometry() %>%
group_by(GEOID) %>%
summarise(nov = n())
# join november data to counties dataset
counties <- counties %>%
left_join(nov_by_county, by = "GEOID")
# calculate NDTI
counties <- counties %>%
mutate(ndti_nov = (event_tweets - nov) / (event_tweets + nov)) %>% # normalized diff tweet index
mutate(ndti_nov = replace_na(ndti_nov, 0)) # replace NULLs with 0's
ggplot() +
geom_sf(data = counties,
aes(fill = cut_width(ndti_nov, width = 0.4)),
color = "grey", lwd=0.01) +
scale_fill_brewer(palette = "GnBu") +
guides(fill = guide_legend(title = "Normalized Difference Tweet Index")) +
labs(title = "Hurricane Ida Twitter Activity",
subtitle = "Using tweets in November 2019 as control") +
theme(
plot.title = element_text(hjust = 0.5),
axis.title.x = element_blank(),
axis.title.y = element_blank()
) +
xlim(-105, -67) +
ylim(24, 47) +
theme_minimal()
# ggsave("ndti_novtweets.jpg", plot = last_plot(), path = here("results", "figures"), device = "jpeg")
The NDTI normalized by the actual November 2019 Twitter data has many more counties with an NDTI around 0. This suggests that these counties had low Twitter activity at this time, and compared the Census, there were fewer tweets in these counties than than population / 10000. Clustered areas of highest NDTI are southern Louisiana and Mississippi and the northeast megalopolis.
Use 110km Euclidean distance and include self in the weight matrix.
county_coords <- counties %>%
st_centroid() %>% # convert polygons to centroid points
st_coordinates() # convert to simple x,y coordinates to play with stdep
thresdist <- county_coords %>%
dnearneigh(0, 110, longlat = TRUE) %>% # use geodesic distance of 110km
# distance should be long enough for every feature to have >= one neighbor
include.self() # include a county in its own neighborhood (for G*)
thresdist # view statistical summary of the nearest neighbors
## Neighbour list object:
## Number of regions: 2683
## Number of nonzero links: 66743
## Percentage nonzero weights: 0.9271807
## Average number of links: 24.87626
Optionally, plot the spatial weight matrix results This should result in a very dense graph because each county is connected to all other counties within 110 km.
plot(thresdist, county_coords, lwd=0.1) # plot nearest neighbor ties
# Create weight matrix from the neighbor objects
dwm <- nb2listw(thresdist, zero.policy = T)
# Get Ord G* statistic for hot and cold spots
counties$locG <- counties$tweetrate %>%
localG(listw = dwm, zero.policy = TRUE) %>%
as.vector()
# check summary statistics of the local G score
summary(counties$locG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.20949 -0.70955 -0.49303 -0.02718 -0.14245 21.53366
Classify G scores by significance values typical of Z-scores where 1.15 is at the 0.125 confidence level, and 1.95 is at the 0.05 confidence level for two tailed z-scores based on Getis and Ord (1995) doi: 10.1111/j.1538-4632.1992.tb00261.x
To find other critical values, use the qnorm() function as shown here: https://methodenlehre.github.io/SGSCLM-R-course/statistical-distributions.html
Breaks and colors from http://michaelminn.net/tutorials/r-point-analysis/ based on 1.96 as the 95% confidence interval for z-scores.
# classify by significance levels
siglevel <- c(1.15, 1.95)
counties <- counties %>%
mutate(sig = cut(locG, c(
min(counties$locG),
siglevel[2] * -1,
siglevel[1] * -1,
siglevel[1],
siglevel[2],
max(counties$locG)
)))
# map results
ggplot() +
geom_sf(data = counties, aes(fill = sig), color = "white", lwd = 0.1) +
scale_fill_manual(
values = c("#0000FF80", "#8080FF80", "#FFFFFF80", "#FF808080", "#FF000080"),
labels = c("low", "", "insignificant", "", "high"),
aesthetics = "fill"
) +
guides(fill = guide_legend(title = "Hot Spots")) +
labs(title = "Clusters of Hurricane Ida Twitter Activity") +
theme(
plot.title = element_text(hjust = 0.5),
axis.title.x = element_blank(),
axis.title.y = element_blank()
)
# ggsave("tweet_hotspots.jpg", plot = last_plot(), path = here("results", "figures"), device = "jpeg")
The temporal analysis of tweets per hour over the course of the hurricane revealed the highest activity occurred on August 29, 2021. This coincides with the hurricane’s peak intensity as it made landfall near Grand Isle, Louisiana. The second highest tweet frequency occurred on September 3, shortly before the hurricane dissipated in the North Atlantic.
The network analysis of retweets of Hurricane Ida tweets revealed that news anchors/analysts (@Elisa_Raffa, @DrRobertCollins, ), news outlets (National Hurricane Center @NHC_Atlantic, the Louisiana CBS channel WWL-TV @WWLTV), and meteorologists/weather enthusiasts (@skydrama, @StormTrackBrian) were most frequently retweeted.
Two main clusters emerge: one about utilities and the other about storm behavior. The utilities clusters has closely related words like power, outage, grid, event. The storm cluster includes words like flooding, waters, insurance, and warning. A smaller cluster establishes links between words “louisiana”, “orleans”, “makes”, “landfall”. The smallest clusters of word pairs are often words that are used together in a phrase like “climate change”, “President Biden”, and “Grand Isle”. The high usage of many Louisiana-related words (Grand Isle, New Orleans) may have occurred during the highest Twitter activity when the hurricane passed through Louisiana.
Both NDTI calculations revealed high Twitter activity along the Louisiana and Mississippi coast and the greater New York metropolitan area. The main difference between the two normalization methods is the most frequent value of counties outside of high NDTI areas. Using the Census population NDTI, most counties have an NDTI of -1 to -0.6, whereas most counties using the November 2019 generic tweets NDTI have an NDTI of -0.2-0.2.
Four major hotspots emerged on the map generated by Getis Ord G*: Louisiana & Mississippi, New York metropolitan area, Iowa, and western Kansas.
The plot of Hurricane Ida tweet locations reveals that many tweets in the data set are outside of the specified radius for the search query. This visual alludes to the uncertainty of the Twitter data quality. The information about Twitter’s methods are proprietary, so we cannot find how this geographic search is conducted. Very few of the tweets had exact coordinates, so most of the coordinates associated with tweets in the data are the centroid of a city or neighborhood. It is also unclear how accurate the tweet location Twitter provides are. This is all to say that before even discussing the spatial analysis conducted in this study, the geographic data contains a lot of uncertainty.
In the absence of a generic Twitter search query to normalize the hurricane-related tweets, I used the county population divided by 10000. Comparing this NDTI to the November 2019 tweets normalization and hotspot maps, all three maps identify clusters of high Twitter activity about Hurricane Ida in Louisiana & the Mississippi coast and the New York metropolitan area. Both of these areas were heavily impacted by the hurricane. The hurricane made landfall on the coast of Louisiana, and New York City experienced flooding due to rain the broke the city’s record for most single-hour rainfall.
The November 2019 search had a different geographic area than the Hurricane Ida tweet searches, so there are no tweets for counties on the western-most part of the map. While the exact NDVI values are not valid, plotting this map shows that using population / 10000 yields can be used in place of a generic tweet baseline and still yield similar results.
While developing this study, I used different spatial extents for the counties (such as including all counties of states that intersected with the search area, so counties outside of the alleged search radius were included). Changing the spatial extent changed the results of the Getis Ord G* hotspot map. When I included more counties with no/low Hurricane Ida Twitter activity (i.e. those on the edges of the western edge of the search radius furthest from the storm), the lowest activity hotspots dropped off the map. This is due to lowering the average hurricane-related activity, so there are more low-activity areas and the lowest-activity becomes less salient. For this study, I was most interested in areas of highest Twitter activity, and these hotspots did not change with changes in spatial extent. Areas around the extent of the study area are also “less connected” in the Getis Ord G* spatial weight matrix due to being surrounded by future counties. This did not seem to affect the study results because the edges of the study area were over the ocean or far from the storm track (the main area of interest).
There are many additions that could be made to this study to enhance the geographic validity of the study. The following are changes I would have liked to implement with more time:
“Stem” words: The word frequency analysis
revealed related words “flood” and “flooding”. These words have
essentially the same meaning, so “stemming” the words in the data set
could yield better insight into the topics of tweets. The
quanteda
text analysis package has a word-stemming
function. Additionally, removing numbers and times (“1”, “4”, and “pm”
made the top word list) could reveal more about tweet word
usage.
Visualizing hotspot neighborhoods: visualize the Getis Ord G* neighborhood around the highest/lowest activity clusters on the hotspot map.
Population by age: Using population / 10000 may not be the best way to estimate tweet rate because the age of Twitter users is not evenly distributed. US Twitter users are predominantly under the age of 50, so using a subset of county population by age (available in the Census 5-year ACS used in this study) could be used instead.
I successfully replicated the Holler (2019) study of Hurricane Dorian Twitter activity for Hurricane Ida using the code from the original study. Changes were made to the original code to use population to calculate tweet rate and improve visualizations.
The original study of Hurricane Dorian was motivated by the Sharpiegate controversy in which President Trump mistakenly insisted on Twitter that Hurricane Dorian would hit Alabama, and inter-agency chaos ensued. It is difficult to assess the validity of information from social media such as Twitter, especially when authority figures are spreading the information.
The implications of this replication study are difficult to assert due to the lack of a free Twitter API that is required to replicate this study with different search terms or area. Analysis of Twitter activity during storms can reveal how people are adapting their collective response to ever-changing storm behavior using technology.
From an open-science perspective, the changes to Twitter’s API availability are decidedly not following the practices of open-science by limiting what data can be accessed and who can access such data. Despite implementing open-science practices in this study design, this study cannot be fully open due to the data source. This exemplifies the challenge of implementing open science when many of the resources employed in science are not yet openly available.
I collaborated with Andy Cao on this project. See his project on his Github site for a different take on this replication.
This report is based upon the template for Reproducible and Replicable Research in Human-Environment and Geographical Sciences, DOI:[10.17605/OSF.IO/W29MQ](DOI:%5B10.17605/OSF.IO/W29MQ){.uri}