Adding Search to My Static Blog Using AWS Lambda and SQLite
2021 July 09

I always wanted to have search on my site, but it’s statically hosted, so I can’t have any dynamic content on it. I recently resolved that by creating a little tiny lambda function on AWS that queries a SQLite database hosted on S3. Here I’ll walk you through how I created the back-end for it, and in some later posts I’ll include details on calculating the cost.

The first step is to build the SQLite database using the FTS3 extension. In order to do that I created a simple ruby plugin for jekyll that looped through all the posts and inserted them into an FTS table into a private location that would not be deployed to the site. The FTS table looks something like so:

CREATE VIRTUAL TABLE posts USING 
    fts3(content, url, title, author, tokenize=porter);

We are using FTS3 instead of FTS4, because that’s what’s compiled and enabled on the AWS Lambda container image, and it doesn’t feel like it’s particularly wise to create and maintain an entire image for something that I don’t reall want to think about. You can check if FTS4 is available with PRAGMA compile_options. Also note that the porter tokeniser is used because it provides much better stemming and matching.

So the next step is to actually determine our search infrastructure which is something like so:

API Gateway <-> Lambda Handler <-> S3 Bucket

The API gateway enables a domain and a series of routes which can be used to invoke the lambda function. The S3 bucket is where we upload the SQLite database, and where the lambda function pulls it down from. This setup work is cached by the lambda function since the same task might be invoked multiple times and we don’t have to pay those start up costs every single time the function is called. The next step is to define some cloudformation so that it can be quickly deployed without messing with a GUI. Remember that infrastructure as code is the way of the future. You’ll note that I use JSON over YAML for my cloudformation. I just find that JSON has a stronger formatting requirement and is less ambigious than YAML.

{
    "AWSTemplateFormatVersion": "2010-09-09",
    "Transform": "AWS::Serverless-2016-10-31",
    "Resources": {
        "SearchFunction": {
            "Type": "AWS::Serverless::Function",
            "Properties": {
                "CodeUri": "./search.py",
                "Runtime": "python3.8",
                "Handler": "search.lambda_handler",
                "Environment": {
                    "Variables": {
                        "SearchDatabaseBucket": { "Ref": "SearchDatabaseBucket" }
                    }
                },
                "Policies": [
                    "AmazonS3ReadOnlyAccess"
                ],
                "Timeout": 3,
                "MemorySize": 1024,
                "Events": {
                    "PerformSearch": {
                        "Type": "HttpApi",
                        "Properties": {
                            "Path": "/search",
                            "Method": "get",
                            "ApiId": {"Ref": "HttpApi"}
                        }
                    }
                }
            }
        },
        "HttpApi": {
            "Type": "AWS::Serverless::HttpApi",
            "Properties": {
                "CorsConfiguration": {
                    "AllowOrigins": [
                        "http://localhost:4000",
                        "https://www.henryschmale.org"
                    ]
                }
            }
        },
        "SearchDatabaseBucket": {
            "Description": "The bucket to store the actual generated test artifacts.",
            "Type": "AWS::S3::Bucket",
            "Properties": {
                "VersioningConfiguration": {
                    "Status": "Enabled"
                }
            }
        }
    }
}

This basically defines our routes and tells cloudformation where to get everything that needs to be setup. You’ll take note of the CorsConfiguration defintion, it says what other origins are allowed to call this API, and actually get the results. You’ll see it contains my website, and where I test it at.

So let’s move on to the actual code within the lambda function. I’m using a python runtime because it’s what I know, the standard library is huge, and boto3 is already availabe in lambda. I’m not going to go too deep in the code, but here’s a pseudo outline:

download_database()

SQL = """
SELECT 
    title
    , url
    , snippet(posts) as snippet
    , round(rank(matchinfo(posts, 'pcx')) * 10, 2) as score
FROM posts
WHERE posts MATCH ?
ORDER BY score ASC
LIMIT 10
"""

def query_database(query):
    with sqlite3.connect("search.db") as db:
        # do database bookkeeping register some custom functions
        return db.execute(SQL, [query])

def lambda_handler(event, context):
    query = event['queryStringParameters']['q']
    items = query_database(query)

    return {
        "body": items
    }    

Basically the lambda runtime invokes the lambda_handler, as requests come in and goes to sleep if there’s nothing to process. Anything outside of the handler gets executed exactly once if it’s at the top level. So at the beginning the SQLite database is downloaded from S3, and it’s potentially kept open for multiple requests. A SQL query is executed using the various FTS functions. For a better introduction on that I recommend this blog post by Charles Leifer.

Finally, a simple search page is built using jekyll, that uses fetch to make a request and render the results to the page. Feel free to check it out at the Search Page, and the source is directly embedded into the page.


UPDATE 2021/07/10 TIL that a database can be packaged up with lambda saving a call to s3, making this search even cheaper to run. Now if this were for a multisite deployment I would probably keep s3 in there and have a database for each site, but I don’t. Just replace the CodeUri in the cloudformation with a path to a zip file containing your code and the database. It would matter more if I had more long form content, but I don’t and the size limit for lambda is very generous (50MB zipped). This is also faster, because there’s fewer network requests made.

*****
Written by Henry J Schmale on 2021 July 09