I have been working on a project where I need to extract RSS feeds from various blogs and news websites. Essentially, I want to pass a URL to my API and have it return the RSS feed associated with that domain.
As with most things, I wasn’t the first person to come across this problem. Aaron Swartz (RIP) wrote his own script called feedfinder.py which does this exact same thing. However, a major shortcoming of this script is that it’s fairly dated and written for Python 2. After fighting a losing battle trying to deal with Python’s 2to3 conversion tool, I realized I’d already wasted more time trying to port this old script than it would take me to write a new one.
My Solution: Python 3 function for extracting RSS feeds from URLs
I wanted my function to be accruate and thorough, which (for me) means:
- I wouldn’t miss any legitimate feeds that were on a website and
- I wouldn’t include any links that were not valid RSS feeds.
I’ve copied my solution below, which you should be able to interpret fairly easily. I start by looking for
<link> tags pointing to RSS feeds, then parse the page looking for any
a hrefs pointing to links with “xml”, “rss”, or “feed” in the URL. Finally, I use
feedparser to go through the list of possible RSS feeds and validate them to ensure that the links point to valid feeds.