[ad_1]
Introduction
Extracting video, image URLs, and text from a webpage can be done easily with Selenium and Python’s beautiful soup. If there are URLs like “https://…video.mp4” as src, then we can directly access those videos.
However, there are so many websites that use blob format URLs like src=”blob:https://video_url”. We can extract them using selenium + bs4, but we can’t access them directly because they are generated internally by the browser.
What are BLOB URLs?
Blob URLs can only be generated by the browser. URL.createObjectURL() will create a special reference to a Blob or File object that can later be published using URL.revokeObjectURL(). These URLs can only be used locally in one instance of the browser and in the same session.
BLOB URLs are typically used to display or play multimedia content, such as videos, directly in a web browser or media player without downloading the content to the user’s local device. They are often used in conjunction with HTML5 video elements, allowing web developers to embed video content directly into a web page, easily
To overcome the above problem, we found two methods to help you extract the video URL directly:
- YT-dlp
- Selenium + Network Logs
YT-dlp
YT-dlp is a very handy plugin to download youtube videos and also extract other attributes of youtube videos like titles, descriptions, tags etc. with him. Below are the steps to use it and sample code.
Install YT-dlp A plugin for ubuntu
sudo snap install yt-dlp
Below is a simple code to extract a video URL using yt-dlp with a python subprocess. We use additional options like -f, -g, -q, etc. A description of these settings can be found on the yt-dlp git hub.
import subprocess
def get_video_urls(url):
videos_url = []
youtube_subprocess = subprocess.Popen(["yt-dlp","-f","all","-g","-q","--ignore-error",
"--no-warnings", url], stdout=subprocess.PIPE)
try:
video_url_list = youtube_subprocess.communicate(timeout=15)[0].decode("utf-8").split("\n")
for video in video_url_list:
if video.endswith(".mp4") or video.endswith(".mp3") or video.endswith(".mov") or video.endswith(".webm"):
videos_url.append(video)
if len(videos_url) == 0:
for video in video_url_list:
if video.endswith(".m3u8"):
videos_url.append(video)
except subprocess.TimeoutExpired:
youtube_subprocess.kill()
return videos_url
print(get_video_urls(url="https://edition.cnn.com/videos/world/2022/12/06/china-beijing-covid-restrictions-wang-dnt-ebof-vpx.cnn"))
Selenium + Network Logs
When a website uses blob URLs and a video is playing, we can access the streaming URL (.m3u8) of that video in the browser’s network tab. We can use network and performance logs to find stream URLs.
What is M3U8?
M3U8 is a text file that uses UTF-8 encoded characters to identify the location of one or more media files. It is commonly used to define a playlist of audio or video files for streaming over the Internet using a media player that supports the M3U8 format, such as VLC, Apple’s iTunes, and QuickTime. The file usually has the “.m3u8” file extension and begins with a list of one or more media files, followed by a series of attribute information lines. Each line of an M3U8 file typically specifies a single media file, along with its name and length, or a reference to another M3U8 file for streaming a playlist of media files.
We can extract network and performance logs using Selenium with some advanced options. Follow these steps to install all required packages:
pip install selenium
pip install webdriver_manager
Below is an example of getting a streaming URL (.m3u8) using Selenium and network logs:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
import json
from selenium.webdriver.common.by import By
import json
desired_capabilities = DesiredCapabilities.CHROME
desired_capabilities["goog:loggingPrefs"] = "performance": "ALL"
options = webdriver.ChromeOptions()
options.add_argument("--no-sandbox")
options.add_argument("--headless")
options.add_argument('--disable-dev-shm-usage')
options.add_argument("start-maximized")
options.add_argument("--autoplay-policy=no-user-gesture-required")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("--ignore-certificate-errors")
options.add_argument("--mute-audio")
options.add_argument("--disable-notifications")
options.add_argument("--disable-popup-blocking")
options.add_argument(f'user-agent=desired_capabilities')
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),
options=options,
desired_capabilities=desired_capabilities)
def get_m3u8_urls(url):
driver.get(url)
driver.execute_script("window.scrollTo(0, 10000)")
time.sleep(20)
logs = driver.get_log("performance")
url_list = []
for log in logs:
network_log = json.loads(log["message"])["message"]
if ("Network.response" in network_log["method"]
or "Network.request" in network_log["method"]
or "Network.webSocket" in network_log["method"]):
if 'request' in network_log["params"]:
if 'url' in network_log["params"]["request"]:
if 'm3u8' in network_log["params"]["request"]["url"] or '.mp4' in network_log["params"]["request"]["url"]:
if "blob" not in network_log["params"]["request"]["url"]:
if '.m3u8' in network_log["params"]["request"]["url"]:
url_list.append( network_log["params"]["request"]["url"] )
driver.close()
return url_list
if __name__ == "__main__":
url = "https://fruitlab.com/video/aTUqTrJrMtj6FgO5?ntp=ggm"
url_list = get_m3u8_urls(url)
print(url_list)
Once you get the stream URL, it can be played using the stream option in VLC media player.
The m3u8 URL can also be downloaded as an .mp4 file using the FFmpeg plugin. It can be installed in ubuntu using:
sudo apt install ffmpeg
After installing FFmpeg, we can easily download the video using the command below:
ffmpeg -i http://..m3u8 -c copy -bsf:a aac_adtstoasc output.mp4
We hope you enjoy these two approaches to advance video scraping. Let us know if you have any questions.
[ad_2]
Source link