I’ve always found it fun to explore the connections between things. I decided to build a simple graph exploring site, imdb.mistelbacher.ca, so that I could spend a couple hours clicking on actors (and maybe learn a thing or two about development in 2025).
During my last term at Waterloo, I took a database course where the final project was to make movie database app. One of the advanced features I added was a recursive SQL query that found how a root actor relates to other actors, through their shared movies. Something like:
WITH RECURSIVE six_degrees AS (
SELECT
p.id AS root_person_id,
p.id AS connected_person_id,
NULL as title_id,
FROM person p
WHERE p.id = ?
UNION ALL
SELECT
c1.person_id as root_person_id,
c2.person_id as connected_person_id,
c2.title_id,
FROM
six_degrees sd
JOIN character c1 ON c1.person_id = sd.connected_person_id
JOIN character c2 ON c2.title_id = c1.title_id AND c2.person_id != c1.person_id
)
SELECT
sd.degree,
root.id as root_person_id,
root.name as root_actor_name,
connected.id as connected_person_id,
connected.name AS connected_actor_name,
t.primary_title
FROM six_degrees sd
JOIN person root ON root.id = sd.root_person_id
JOIN person connected ON connected.id = sd.connected_person_id
LEFT JOIN title t ON t.id = sd.title_id
ORDER BY sd.degree
But SQL databases really struggle with recursive queries, especially on large datasets (we were using the full IMDB non-commercial dataset with over 10 million movies, and my .sqlite file was 10gb). I knew I wanted to go back and design around this feature from the ground up.
The plan was to use a graph database, specifically the apache AGE extension for Postgres, to power a graph for the user to explore. I chose NestJS and React for the frontend since I use both at work and wanted to get some more reps in. I’d use the IMDB non-commercial dataset to power the app.
The goal was twofold: to make something fun, but also to master the basics. I wanted to
The first thing to do was to load data into the DB. Since Apache AGE is built on top of a SQL database, writing queries is a little funky. I made a simple wrapper that takes just the openCypher query and handles formatting and output parsing (and ensures the output is nicely typed).
async runCypher<T>(
query: string,
params: Record<string, any> = {},
columns: string[] = ['result'],
): Promise<T[]> {
const text = `
SELECT * FROM cypher(
'${this.graphName}',
$$ ${query} $$,
$1
) AS (
${columns.map((c) => `${c} agtype`).join(',')}
)
`;
const values = [params ? JSON.stringify(params) : 'NULL'];
const res = await this.client.query(text, values);
return res.rows.map((row: Record<string, any>) => {
const parsedRow: Record<string, any> = {};
for (const col of columns) {
const ag = row[col];
// could be an object (node/edge) or a scalar
if (typeof ag !== 'object') {
parsedRow[col] = ag;
continue;
}
const props = ag?.get('properties');
parsedRow[col] =
props instanceof Map ? this.mapToObject(props) : (props ?? ag);
}
return parsedRow;
});
}
The file import.ts then handles:
One big issue was the speed, especially of the following cypher that inserted actor relationships:
UNWIND $rels AS rel
MATCH (p:Person {id: rel.personId}), (m:Movie {id: rel.movieId})
CREATE (p)-[:ACTED_IN]->(m)
The MATCH (like a SQL SELECT) was taking forver. I did some research online and found that this could be sped up dramatically with a bit of a hack. Apachge AGE has underlying tables with JSONB columns that store the properties of a node. What you can do is make a GIN index on these JSONB columns, like so:
CREATE INDEX person_properties_gin ON imdb_graph."Person" USING GIN (properties);
CREATE INDEX movie_properties_gin ON imdb_graph."Movie" USING GIN (properties);
The whole data load process can then be kicked off from inside the container with npm run import
, and completes in under an hour.
I also wanted to touch on some of the things I did to improve the graph UX:
New nodes are placed randomly in a circle around the node that was clicked. I tried having nodes “stream” in but I found it more annoying than having it all appear instantly
If there are multiple edges between the same actor, they are grouped together and the width of the existing edge is increased
Node size scales logarithmically with its degree
Nodes (Actors) are grouped together by their common movies. This one is a bit tricky, but basically if two nodes are in the same movie, we create invisible “pseudo” edges between them so that they are attracted to eachother. This results in the nice grouping by movie around an actor (otherwise, it would all be random)
This isn’t a huge surprise, but LLMs work best when you already know what you want to do. For example, it really did speed up my FE + BE development since I knew exactly what I wanted to do, and could tell at a glance when it gave me good advice or BS.
On the other hand, I struggled with the production deployment. Even though it really isn’t difficult to set up NGINX, I just didn’t know how to set up a VM in general and the LLM (I used chatGPT mainly) often led me down the wrong paths. After a bit of research I realized all I needed was to set up NGINX + certbot directly on my VM, and have a simple config that proxied either to my backend (/api/) or frontend.