It’s All Math - Learning webscraping

I’ve always wanted to scrape the metalstorm.net top albums list since it seems a more accurate measure of how the mainstream metal community ranks albums. Rateyourmusic top metal albums are more skewed towards the early days of the genre. RYM has 3 Black Sabbath albums in the top 5, and in my opinion these are very weak albums compared to newer releases.

library(rvest)
library(tidyverse)

link <- "https://metalstorm.net/bands/albums_top.php"
page <- read_html(link)

band <- page %>%
  html_nodes('b a') %>%
  html_text()

album <- page %>%
  html_nodes('td.hidden-xs a') %>%
  html_text()

release_year <- page %>%
  html_nodes('td.hidden-xs .dark') %>%
  html_text()

score <- page %>%
  html_nodes('.rating-cell a') %>%
  html_text()

votes <- page %>%
  html_nodes('.rating-cell .dark') %>%
  html_text()

After scraping the albums I need to clean them up a bit. Both score and votes have double counted the results, and have selected a bit of junk. First let’s look at score and see if there is an easy solution.

head(score)

[1] "9.29" "9.29" "9.28" "9.28" "9.28" "9.28"

The entries are all duplicated, but 2 albums could have the same score so using a simple filter like unique won’t be the right approach. The easy method here is to just select every other entry.

clean_score <- score[seq(1, length(score), 2)]
head(clean_score)

[1] "9.29" "9.28" "9.28" "9.26" "9.25" "9.25"

Cleaning up votes (and years) will be slightly harder since it has a mixture of things in it.

head(votes)

[1] "| 3172" "3172"   "| 2190" "2190"   "| 2372" "2372"

Using grep I’ll just select all the numeric elements, and because I’ll use the same trick to get every other element.

numeric_votes <- gsub("[^0-9]", "", votes)
clean_votes <- numeric_votes[seq(1,length(numeric_votes), 2)]
clean_release_years <- gsub("[^0-9]", "", release_year)

Now to put it all into one data frame.

top_200_albums <- data.frame(band_names = band,
                             album_names = album,
                             release_years = as.numeric(clean_release_years),
                             album_scores = as.numeric(clean_score),
                             album_votes = as.numeric(clean_votes)
                             )
head(top_200_albums)

    band_names        album_names release_years album_scores album_votes
1     Megadeth      Rust In Peace          1990         9.29        3172
2 Judas Priest         Painkiller          1990         9.28        2190
3        Death           Symbolic          1995         9.28        2372
4        Opeth    Blackwater Park          2001         9.26        3178
5    Metallica Ride The Lightning          1984         9.25        3586
6    Metallica  Master Of Puppets          1986         9.25        3790

Trying out Plotly since I’ve never used it before and people say it’s a pretty nice package for interactive graphs.

library(plotly)
    plot <- plot_ly(top_200_albums, x = ~album_scores, y = ~album_votes, 
                    text = ~paste("Band: ", band_names,"\nAlbum: ", album_names, "\nScore: ", album_scores, "\nVotes:", album_votes),
                    hoverinfo = "text",
                    type = "scatter", mode = "markers")
    
    plot <- plot %>% layout(title = "Top 200 Albums",
                            xaxis = list(title = "Album Score"),
                            yaxis = list(title = "Album Votes"))
    plot

And there you have it. My first webscraping project. I’ve wondered if there is a correlation between how popular a metal band is on Metalstorm and the band winning an award in the community album of the year votes that come around every February. Maybe my next webscraping project will be investigating this.