How to Easily Extract All URLs from Your Sitemap Using Python

Hello, SEOs!

It’s been a while since my last post I’ve been revamping my setup and got myself a comfy new chair, which is perfect for coding those technical SEO solutions.

Today, I’ve got a useful script to share with you, perfect for anyone managing a site. Imagine you want to extract all URLs from your sitemap.xml file quickly.

Maybe you’re analysing your site’s content, or perhaps you need to keep track of which pages are active. With Python, you can do this in just a few lines of code.

Here’s the code that makes it happen:

import requests
import xml.etree.ElementTree as ET
import os
from IPython.display import display, HTML

# Base URL of the primary sitemap.xml file
sitemap_url = "https://olehdankevych.com/sitemap_index.xml"

# Directory to save URLs
output_dir = "sitemap-links"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# List to keep track of all extracted URLs
all_urls = []


def fetch_sitemap_urls(url):
    """Fetch URLs from a sitemap, including nested sitemaps."""
    response = requests.get(url)
    if response.status_code == 200:
        root = ET.fromstring(response.content)

        # Check each <sitemap> tag for nested sitemaps
        for sitemap in root.findall(".//{http://www.sitemaps.org/schemas/sitemap/0.9}sitemap"):
            nested_sitemap_url = sitemap.find("{http://www.sitemaps.org/schemas/sitemap/0.9}loc").text
            print(f"Found nested sitemap: {nested_sitemap_url}")
            fetch_sitemap_urls(nested_sitemap_url)

        # Extract URLs from <loc> tags for this sitemap
        for url in root.findall(".//{http://www.sitemaps.org/schemas/sitemap/0.9}loc"):
            url_text = url.text
            print(f"Extracted URL: {url_text}")
            all_urls.append(url_text)

    else:
        print(f"Failed to fetch sitemap at {url}. Status code: {response.status_code}")


# Fetch all URLs starting from the primary sitemap
fetch_sitemap_urls(sitemap_url)

# Save all URLs to a file
file_path = f"{output_dir}/all_urls.txt"
with open(file_path, "w") as file:
    for url in all_urls:
        file.write(url + "\n")

# Display message and download link
print("\nAll URLs have been saved in 'all_urls.txt'. You can download the file below:")
display(HTML(f'<a href="/content/{file_path}" download="all_urls.txt">Download URLs file</a>'))

Why Use Python to Read Sitemap URLs?

For anyone working with websites, sitemap.xml files are a must-have for search engines to crawl all your pages. Instead of manually going through each URL, this Python script will handle the job by reading the sitemap, parsing the XML, and saving all the URLs for easy access.

What This Code Does:

1. Downloads the Sitemap: It pulls your sitemap from the web and checks if it’s accessible.
2. Parses the XML: With Python’s XML parser, it searches for <loc> tags (where the URLs are stored).
3. Extracts and Saves URLs: Finally, it writes all the URLs into a text file, urls.txt, for you to access easily.

Give it a try and let me know how it works for you!

P.S. I will upgrade it and soon it will have more detailed checks.

I have placed the updated code for extracting URLs from sitemap.xml and its nested sitemaps on Google Colab. You can access it and run the code directly from this link: How to Easily Extract All URLs from Your Sitemap Using Python

****
Enjoy this? ♻️ Repost it to your network and follow Me for more.

Was this article helpful?
YesNo
Leave a Reply 0

Your email address will not be published. Required fields are marked *


× Contact me