Scrape AJAX Websites with Selenium
Selenium is a great tool to help us obtain data from websites. However, some may encounter problems in scraping websites built with the Single Page Application (SPA) principle, commonly implemented in AJAX. In this article, I will explain how to scrape such websites with an example — a song scraper for MP3Juices.
Firstly, make sure you have Selenium installed. Import the libraries used in this project:
import time
import os
import sys
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
Next, set the options for the Chrome Driver:
prefs = {"download.default_directory":path}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option("prefs",prefs)
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument("--disable-notifications")
chrome_options.add_argument('--disable-dev-shm-usage')
where path is a string containing the directory you want the downloaded songs stored.
Want to read this story later? Save it in Journal.
loc=r'LOCATION_OF_YOUR_CHROMEDRIVER'
driver= webdriver.Chrome(executable_path=loc,options=chrome_options)
driver.get('https://www.mp3juices.cc/')
songs=["bohemian rhapsody","castle on the hill","everything i wanted"]
Here, we use the driver to load the content of the website. songs is a list containing the names of the songs you want to download(automatically). Another alternative to download a large number of songs is to store them in a txt file and read the txt file in your Python script.
To effectively locate the data we want to scrape, we have to understand the DOM structure of the website and also the name/id of the elements involved in obtaining the required data. Inspect element(CTRL+SHIFT+I) allows us to do so.
wait = WebDriverWait(driver, 10)
for song in songs:
search_box = driver.find_element_by_name('query')
search_box.clear()
search_box.send_keys(song+' '+'lyrics')
search_box.submit()
wait.until(lambda driver: driver.execute_script("return jQuery.active == 0"))
It is pretty obvious that we have to search for the song on the search bar provided. By looking at the index.html file, the name of the search bar element is ‘query’. The find_element_by_name function locates it and send_keys ‘types’ the name of the song into the search bar.
wait.until(lambda driver: driver.execute_script("return jQuery.active ==0"))
jQuery.active is an internal variable of jQuery which represents the number of ajax calls.Thus this line of code checks whether the downloadable MP3s are loaded or not. If they are not yet loaded, it waits for 10ms and then checks again. This prevents the driver to try to retrieve elements that are not yet loaded and running into errors.
(STILL IN THE FOR LOOP)
button=driver.find_element_by_link_text('Download')
driver.execute_script("arguments[0].click();",button)wait.until(lambda driver: driver.execute_script("return jQuery.active == 0"))
After the MP3s are loaded, we have to click the white ‘Download’ button. We can use execute_script which executes JavaScript code passed in as an argument to ‘click’ the button. Again, it takes time to load the resulting AJAX calls so use the same mechanism described above to ensure that new elements are loaded before we carry out other steps.
url = driver.find_element_by_class_name('url')
driver.execute_script("arguments[0].click();", url)
After clicking on the white button an <a> tag should be loaded (which is the black ‘Download’ button) and we now have to go to the url in it to download the MP3.
For the driver to quit just after all downloads have been completed, we have to add a function
def download_checker():
for i in os.listdir("C:\\Users\\user\\Downloads\\Songs")[:3]:
if ".crdownload" in i:
time.sleep(0.1)
download_checker()
and call it after all downloads are started
(OUTSIDE THE FOR LOOP)
download_checker()
driver.quit()
That’s all! All songs in the list songs will be in the directory you defined at the very beginning. Of course, you might not be able to download songs that are not well known as the ranking algorithm of MP3Juices may assign it a lower rank. This project is quite useful when you want to download your list of songs (which can be a pain to do it manually) automatically.
The code above can be found in this Github repository.
📝 Save this story in Journal.
👩💻 Wake up every Sunday morning to the week’s most noteworthy stories in Tech waiting in your inbox. Read the Noteworthy in Tech newsletter.