Processing (large) user-uploaded files in Node web API

In this post I’m talking about how to organize processing of user-uploaded files in Node API with scalability and good user experience in mind

Problem

Let’s say your app allows user to import data from excel spreadsheet. Simple approach to this task will be following:

User uploads file to /import endpoint
Backend processes file immediately
Backend returns result.

This approach works absolutely fine as long as files are small and processing does not involve heavy calculations, database or network requests. But in practice either someone wants to import a big file (makes sense, small amount of data data can be entered via simple form) or you need to query additional data from somewhere in order to parse it. This leads to import endpoint taking too long to answer. Sometimes longer than connection timeout.

This is bad for a few reasons:

Each network connection eats up server’s resources, even if connection is idle and client just waiting for an answer
As Node is single-threaded, endpoint that performs heavy calculations may slow down your API for other clients
If connection is timed out, it’s bad experience for user who just wasted couple of minutes waiting and needs to start over
In case of fatal error because of bad data (which might happen given that you are not in control of file contents), it takes out the whole API instance

Solution

Solution to this problem is to use queue manager (also known as message broker). You need to change your import flow to following:

User uploads file, sending it and any additional data to /import endpoint
Backend saves file into temporary storage (i.e. some folder on server) without processing
Backend creates a “job” entry in database, which contains status of task (pending, running, success, error), response text and any additional info you might need (for example, you might want to store user id, file name, etc)
Backend puts message in queue using queue manager (RabbitMQ, for example). Message should include job id, uri of temporary stored file and any additional info that is needed for import
Backend responds back with job id. Steps 2 – 6 should not take a lot of time to finish. Still, reasonable timeout and file size limit should be configured on your server
Client app informs user that import is started and might take a while, and starts to poll /jobs/:id endpoint periodically (not do it too often, once in 15s is enough). You may choose to display loading indicator (spinner) and tell user to wait right there or allow them to do some other things
Job endpoint only reads job from database, so it’s fast
Meanwhile, Worker waits for messages from queue manager. Worker is a separate node program (has it’s own entry point) that should be dedicated to do the only one kind of job – processing data from user-uploaded files. I also recommend to configure it to process only one file at time
When worker receives message, it changes job’s status to “running” and starts to process file
When file is processed successfully, worker changes job’s status to “success”. If error is happened, worker changes job’s status to “error” and also updates error text. You must design worker in a such way that it never crashes without updating job’s status in database. So use try catch and in case of error update job and restart worker.
When client app notices that job’s status is changed to success or error, it stops polling and reports back to user. If there was a spinner, remove it and navigate to screen with import results, otherwise send an in-app or browser notification

This looks quite complex but is not as hard to implement as it seems. And result worth it.

How to further improve things

That basic process if enough for small/mid-sized service. If you have many users, you might consider to do the following:

Use in-memory database to store job’s statuses. For example, Redis. You may use both (database for permanent storage and Redis to answer client faster when it requests status
Increase amount of worker processes. Initially I recommend to have two, which gives ability to process 2 files in parallel and if one worker process is down, there is an another one. Use pm2 to auto-restart them in case of fatal errors. You can even deploy workers on different virtual machine(s) than your API
Introduce dedicated queues and workers for large and for small files. For this, import endpoint should read file size and send message to appropriate queues, and workers should be configured to process files from different queues. In this way users who import relatively small files should not be blocked by ones who upload large amounts. It might be the case that more users upload smaller files and so you may have more worker processes that work with them

Described approach can be used not only for file upload but for other tasks that might take unpredictable long time, for example if your service should support user’s data export.

Featured image source – pxhere.com. For those who wonder, “Why rabbit?” – it’s because popular queue manager is called RabbitMQ. Choosing featured image is very hard.