← Prev. | Next →

New RSS Feed

I added an RSS feed for the few site visitors who would like to receive updates when I post here. I’d been putting off building an RSS feed for years. The specification has many quirks (especially when it comes to dates), and I thought it wouldn’t really be worth the hassle. My MPhil dissertation relies heavily on RSS however, so I’ve now become quite the expert!

I’ve noticed a slight resurgence of interest in RSS (on Hacker News mostly), and I recently had the pleasure of learning awk. I used this newfound knowledge and motivation to quickly script an RSS builder. For anyone interested in making their own RSS feed using awk, I included my source code in Listing 1. The script is applied to all HTML files on my website, and it leverages the fact that my site is made up of a flat file structure, and incrementing permalinks. I would recommend this link structure to anyone who likes the idea of managing their blog mainly with UNIX commands and pipelines.

For those who do a lot of work with grep and sed already, you’ll save a lot of time and effort if you learn some basic awk scripting. I’ve never seen a language like it when it comes to string and stream processing.

The script in Listing 1 will most likely not work on your site (out-of-the-box), but you can tweak it for your own purposes, or just read it for ideas.

The basic idea is this:

  1. For each HTML file…
  2. search for the title,
  3. search for the creation date,
  4. search for content,
  5. and use the file name
  6. … to generate RSS items.

I don’t usually bother with description tags, so I left that part of the RSS specification out of the process. Maybe one day I’ll add a bot to my server that automatically describes longer posts.

#!/usr/bin/bash
ls *.html\
    | sort -r\
    | head -10\
    | while read file; do
        echo "<<$file>>"
        cat $file
    done\
    | awk '
    function escape(t) {
        gsub(/&/, "\\&amp;", t);
        gsub(/</, "\\&lt;", t);
        gsub(/>/, "\\&gt;", t);
        return t;
    }
    function clean(t) {
        gsub(/&rsquo;/, "’", t);
        return t;
    }
    BEGIN {
        RSS_DATE_FORMAT = "%a, %d %b %Y %H:%M:%S %Z"
        content = ""
        inContent = 0
        attribute = 0
        file = ""
        print  "<?xml version=\"1.0\" encoding=\"UTF-8\"?>"
        printf "<rss"
        printf " xmlns:dc=\"http://purl.org/dc/elements/1.1/\""
        printf " xmlns:content=\"http://purl.org/rss/1.0/modules/content/\""
        printf " xmlns:atom=\"http://www.w3.org/2005/Atom\""
        print  " version=\"2.0\">"
        print  "\t<channel>"
        print  "\t\t<title>Seán Healy</title>"
        print  "\t\t<description></description>"
        print  "\t\t<link>https://seanh.sh</link>"
        print  "\t\t<lastBuildDate>"strftime(RSS_DATE_FORMAT)"</lastBuildDate>"
    }
    /^<<[0-9]{4}.html>>$/ {
        if (attribute == 0) print "\t\t<item>"
        file = gensub(/[<>]/, "", "g", $0)
        url = "https://seanh.sh/"file
        print "\t\t\t<link>"url"</link>"
        print "\t\t\t<guid>"url"</guid>"
        attribute++
        attribute++
    }
    /^<title>[^<]*<\/title>$/ {
        if (attribute == 0) print "\t\t<item>"
        print "\t\t\t"clean($0)
        attribute++
    }
    /^<p class="date">[^<]*<\/p>$/ {
        if (attribute == 0) print "\t\t<item>"
        datetime = gensub(/[^0-9]/, " ", "g", gensub(/<\/p>/, "", "g", $3" "$5))
        timestamp = mktime(datetime" 00")
        print "\t\t\t<pubDate>"strftime(RSS_DATE_FORMAT, timestamp)"</pubDate>"
        attribute++
    }
    /^<div id="content">$/ {
        inContent = 1
        next
    }
    /^<header id=["'"'"']?header["'"'"']?>$/ {
        if (attribute == 0) print "\t\t<item>"
        inContent = 0
        print "\t\t\t<content:encoded>"
        print escape(content)
        print "\t\t\t</content:encoded>"
        content = ""
        attribute++
    }
    attribute == 5 {
        print "\t\t</item>"
        attribute = 0
    }
    inContent {
        content = content"\n\t\t\t"$0
    }
    END {
        print "\t</channel>"
        print "</rss>"
    }
    ' > rss.xml

Author: Seán Healy

Created: 2020-06-06 Sat 20:12

Validate