library(rvest)
library(tidyverse)
<- "https://metalstorm.net/bands/albums_top.php"
link <- read_html(link)
page
<- page %>%
band html_nodes('b a') %>%
html_text()
<- page %>%
album html_nodes('td.hidden-xs a') %>%
html_text()
<- page %>%
release_year html_nodes('td.hidden-xs .dark') %>%
html_text()
<- page %>%
score html_nodes('.rating-cell a') %>%
html_text()
<- page %>%
votes html_nodes('.rating-cell .dark') %>%
html_text()
I’ve always wanted to scrape the metalstorm.net top albums list since it seems a more accurate measure of how the mainstream metal community ranks albums. Rateyourmusic top metal albums are more skewed towards the early days of the genre. RYM has 3 Black Sabbath albums in the top 5, and in my opinion these are very weak albums compared to newer releases.
After scraping the albums I need to clean them up a bit. Both score and votes have double counted the results, and have selected a bit of junk. First let’s look at score and see if there is an easy solution.
head(score)
[1] "9.29" "9.29" "9.28" "9.28" "9.28" "9.28"
The entries are all duplicated, but 2 albums could have the same score so using a simple filter like unique won’t be the right approach. The easy method here is to just select every other entry.
<- score[seq(1, length(score), 2)]
clean_score head(clean_score)
[1] "9.29" "9.28" "9.28" "9.26" "9.25" "9.25"
Cleaning up votes (and years) will be slightly harder since it has a mixture of things in it.
head(votes)
[1] "| 3172" "3172" "| 2190" "2190" "| 2372" "2372"
Using grep I’ll just select all the numeric elements, and because I’ll use the same trick to get every other element.
<- gsub("[^0-9]", "", votes)
numeric_votes <- numeric_votes[seq(1,length(numeric_votes), 2)]
clean_votes <- gsub("[^0-9]", "", release_year) clean_release_years
Now to put it all into one data frame.
<- data.frame(band_names = band,
top_200_albums album_names = album,
release_years = as.numeric(clean_release_years),
album_scores = as.numeric(clean_score),
album_votes = as.numeric(clean_votes)
)head(top_200_albums)
band_names album_names release_years album_scores album_votes
1 Megadeth Rust In Peace 1990 9.29 3172
2 Judas Priest Painkiller 1990 9.28 2190
3 Death Symbolic 1995 9.28 2372
4 Opeth Blackwater Park 2001 9.26 3178
5 Metallica Ride The Lightning 1984 9.25 3586
6 Metallica Master Of Puppets 1986 9.25 3790
Trying out Plotly since I’ve never used it before and people say it’s a pretty nice package for interactive graphs.
library(plotly)
<- plot_ly(top_200_albums, x = ~album_scores, y = ~album_votes,
plot text = ~paste("Band: ", band_names,"\nAlbum: ", album_names, "\nScore: ", album_scores, "\nVotes:", album_votes),
hoverinfo = "text",
type = "scatter", mode = "markers")
<- plot %>% layout(title = "Top 200 Albums",
plot xaxis = list(title = "Album Score"),
yaxis = list(title = "Album Votes"))
plot
And there you have it. My first webscraping project. I’ve wondered if there is a correlation between how popular a metal band is on Metalstorm and the band winning an award in the community album of the year votes that come around every February. Maybe my next webscraping project will be investigating this.