Mediafai – Luke Beales

There was a need for a product that could cover a gap in the market, which was to provide configuration-free and extremely cost-effective custom video generation rapidly for franchises, with extremely low running costs and minimal/no staff. To make it viable this needed to be truly industry agnostic, useable for all sizes of businesses, and with the technology available via license and APIs to be an engine within other platforms. Let’s take a look at the benefits and concerns for the business just like with the case studies.

Beneficial	Concerning
History of commercially successful platforms, including a previous video generator	Minimal direct sales experience
Existing validated traction & PMF	Limited capital
Not a first time founder	Increasing competition from AI, agencies, big players
Very low development costs
Exposure to customer requirements, needs, wants
Indirectly known in the startup & investment scene

How to achieve a SaaS quicker and for lower costs

There’s a number of ways you can dramatically reduce the costs for development without impacting the product initially. This would get the first stage out there infront of customers quicker, to gain traction and provide a solid base and pathway to further stages. The general rules were:

No currencies. The platform would not handle any money, currency conversions, or payments directly. This would all be passed off to a third party as they have the scaled infrastructure and security already. It would also be a monthly subscription service – not a pay-per-service. This further removes the platform from needing to know anything currency or balances – all it needs to know is whether a subscription is paid/valid or not.
No translations (but maybe later). Translations can be handled a few ways, either with a third party translating the page, or more controlled by having consistent labelling and storing the translations in the product itself. The third party method is the easiest, and can be added on later. In-product translations can happen further down the track if a customer requires more accurate results.
Simplified infrastructure. There won’t be millions of customers overnight so this will be built as a monolithic application (not serverless, no IaC) – one development chain, one product. As there’s minimal developers it removes a lot of overheads allowing rapid prototyping. This can have faster hardware thrown at it (vertical scaling) as the customer base grows, with horizontal scaling coming later to keep costs low.
Simplified interface. Using a component library (CSS framework) for the interface allows for a friendly user interface for the customer with a consistent look and feel, without the need for expensive staff at this stage. A huge head start.
Shared components. Utilising shared components & functions allows for tremendous time savings. The downside is the same as any shortcut – it’s a cost later on (technical debt). However the aim is to get a product out there fast to solve customers problems. Once the customers are onboard this can be addressed – and often the product will be at a pivot point anyway.
3rd party authentication. Shifting the authentication off to a 3rd party removes all of the trickery and functions for auth away, leaving us with just a flag saying whether the user is authenticated or not, and even a permissions level. Much simpler, moves the encryption and security of logins & passwords away, and also shifts all the “sign in with google” auth workload off of our plate later. HIGHLY recommend this approach.
Boring, time-tested technologies. In the IT world there are rapid cycles & trends that come and go within years. It’s happening currently with AI/AGI, it happened before with cryptocurrency, the same thing happens with programming languages, frameworks, and infrastructure. It’s easy to get caught up in these waves, but we’ll be sticking to tried and tested technologies to avoid any unknowns from the trends.

With that decided, there’s a key make or break feature to this – the video generation and capturing. All SaaS platforms are basically an interface for a user to change or manipulate data, but they usually contain a few calculation/processing components underneath which make them unique. For this product, it has the general user interface where the customer can adjust parameters and monitor statuses like every other SaaS platform, but it has a key processing-heavy component of video generation and capture.

If a custom video can’t be achieved extremely cost-effectively and rapidly this product won’t be sustainable, or any different to the numerous other platforms out there that are basic video generation tools. Having built a previous video generation platform I was aware of the high costs and bottlenecks that plagued it, however it was horizontally scalable which is a common technique that will also be used for this product.

Horizontally scaling video generation

Generation of a video is fairly straight forward. There are two main steps:

Generate each layer – capturing them frame by frame. This can happen in any order.
Combine all the layers together frame by frame resulting in a video. Importantly – this can ONLY happen once all the layers are completed.

The simplified yet bottlenecked method of generating a video

That’s an extremely simplified version of it, but the key part there is that each layer needs to be captured individually first. To horizontally scale this is simple – just send each layer off to a different server. Then when it comes to combining the layers unfortunately that could only be achieved with a single server which also had to wait for the others to complete first.

The high costs of the previous video engine were a result of how many servers it took in order to keep the rendering times to a reasonable level, but an interesting point was that it could only reach a minimum rendering time of around 6 minutes (for a ~30 second video) due to the 2nd (combining) step. This timing was true if there was only one video being generated in that time, but if 10 videos were generated simultaneously a user could be waiting an hour before receiving a result. While the servers were arranged to be extremely cost effective (headless, shared storage, cached, low tier), it’s not a solution for this product. We need to find a more efficient way of using less servers, with an old animated movie holding one of the answers.

How does the 1995 film Toy Story fit in?

I have always enjoyed the art of 3d modelling & rendering, the tools used, the ways the images and videos are created. When Toy Story burst on to the scene I was fascinated, as I became aware of the tools Pixar used to create the movie which was known as renderman. This application was commercial so out of my hobbyist reach however there was an open source (free) alternative called BMRT (Blue Moon Rendering Tools) giving everyone the ability to generate more realistic renders. But how do these tools generate a complex 3d movie with the relatively minimal computing power at the time?

One technique they used was to create a render farm – this is their version of the horizontal scaling mentioned above – where they linked multiple rendering servers to work together to generate the videos. But instead of sending a frame to each server (as we were doing before), all of their servers would work on the one frame at the same time.

This is achieved by having each server render a portion of the frame. For example, if there were only two servers within the render farm, both working on the same frame, server 1 would generate the top half, server 2 would generate the bottom half. The more servers there are, the more each frame is divided up. It can be divided up even further than the number of servers, but this is the basic idea of it.

Applying this to a SaaS website

To accommodate the frame splitting method in our process, we change the earlier 2 steps to 3.

Generate an area or region of a layer
Combine all layers together frame by frame for that area or region
Combine all areas or regions together

How a render farm distributes work to each node, then combines them all together.

You can see it’s a fair bit more complicated than the earlier method, however the benefits to this are that the jobs (generating each area or region) are smaller, faster, and require less memory. We can use much cheaper hardware instances to perform these jobs, and scale the number of instances to speed it up. It greatly reduces the monolithic final step that hampered the previous video generator from getting any quicker, and requires no special ordering algorhythms that needed to be implemented. A side benefit is it produces a more accurate progress report to know when the video will be ready so the user isn’t left in the dark.

After a fair few trial and error stages with manually run scripts, a proof of concept was created showing this process was possible and performed in a way the product could be achieved. It was time to make it in to something a trial customer could use without them getting in too much trouble, and with them requiring only a little bit of assistance.

Cost effective, scale-ready infrastructure

We have our video engine method worked out – we need to get it scale-ready as cost-effectively as possible to show it to the world & get feedback. The way to do this (in the non-serverless world) is to create a load balancer. It’s a bit of a cost but worth it, as it gives us two speedy shortcuts:

A single Security Certificate (SSL) entry point that we don’t have to maintain – greatly simplifying maintenance for multiple server arrangements.
The automatic ability to distribute connections (both internal and external) around the instances or servers that sit underneath – usually in what’s called a “round-robin” fashion.

What this means is a load balancer keeps a simple counter of how many servers there are underneath it. When a user visits the SaaS website, it will send them to the server matching the current counter, and then add one to its counter ready for the next user. The first user will go to server A, the second will go to server B, and so on. This distributes (or balances) the load meaning we can use far less resources per server – which is lower cost than one large server – and more servers can be added when more traffic arrives making it infinitely* scalable. This also applies to our video rendering method from above by creating a render farm, so these servers have a dual purpose – to serve users coming in from online, and to render regions of each video.

The moment of truth

There was a fair bit of development up to this stage to make it easy enough to create the individual layers, give them media and appearance parameters, and timing so the video engine would know what to do with them. With all the layers arranged in a 30s video pushing the engine to it’s limits it was time to generate a video to let everyone know what was coming.

A screenshot of the original “accelerate your creatives” mediafai video layers, this is generations beyond the very basic list interface at the time.

The resulting video from the layers shown in the screenshot.

I recall this video was generated in under 10 minutes, despite it being extremely effects heavy. The best part was this was achieved with just two low cost servers – significantly reducing both time and costs from the previous video generation tool (60+ servers). This proved complex custom videos could be generated extremely cost-effectively from a web UI – keeping in mind most elements from this video can be pulled in dynamically (eg. the AI voiceover) and customised to the user/brand at the time of generation. This will be important for the automation goal later on.

With the concept proven and a platform now mostly in place, it was time to make it user-friendly & launch it.

Thank you for reading, please check back soon for the next part of this series.