Scalability Challenge in Karobar Easy
January 31, 2022 by Lavish Swarnkar
Namaskaram,
The recent growth in Karobar Easy users made the attendanceUpdater Cloud Function fail! This failure caused inconsistency in the database several times. It took 2 rewrites of the entire cloud function to finally fix the issue. This has been a challenging yet interesting journey for me. So, here I am sharing the incident with you all.
The attendanceUpdater Cloud Function (ClFn) runs at midnight everyday. It has the following responsibilities :
-
Creating new cycles for staffs whose month ended the day before
-
Marking attendance for WPO (Weekly Paid Off) & PH (Paid Holiday) staffs at 3 nodes in the database :
-
currentAttendance
-
monthAttendance
-
cycle
-
First Appearance
At the time of V1 development when the cloud function was created for the first time (March 2021) , I tested it thoroughly to make sure everything works. And it did work perfectly fine until Nov-Dec 2021. That was the time when ClFn started failing. But sadly no organization noticed or at least reported the issue. Animesh bhaiya was the first one to finally report the issue last week, as - “WPO nahi lag rahi hai!”.
No errors in log!
I checked the logs immediately but found no error reported there. I waited to see the bug myself the next day. So, I marked WPO for my organization’s staff and waited for the ClFn to mark the attendance. To my surprise, it was marked successfully!
It was Weekend!
I noticed that the days ClFn failed were on Weekends. Rest of the days it worked as expected. Why? That’s the day when the majority of the staff have WPO. I had no clue what the bug was & why the ClFn was failing to mark attendance for a large number of staff.
The Christmas night
There was no error in ClFn logs for 2022. But I kept digging the logs, and finally found just one error reported on 25th Dec, 2021. It was “Error 4 : Deadline exceeded”. That was a sufficient hint for me. The error was of ClFn timeout. Every ClFn has 60 seconds as it’s default timeout i.e. it has only 60s to complete its execution. After that it is terminated. And because of this forced termination, it was unable to even report the error in the logs.
Increase the timeout!
I increased the timeout to it’s maximum - 9 minutes (540 seconds). And I waited for the next Sunday. The day came but still the ClFn failed to mark attendance even after executing for 7 mins 47 secs 78 ms ( or 467,078 ms). Thank God, this time it threw the same error in the logs! Moreover, I was not satisfied with this - increase the timeout solution. Because marking WPO for just 173 staff took ~8 mins. And the maximum timeout for Firebase ClFn is 9 mins. How is this scalable?
The first rewrite
I observed the ClFn’s code carefully & found that it included sequential execution & Firebase Write Batches. Sequential execution means it was marking attendance for each staff one after one & not in parallel which caused long execution time. Code was something like:
for each staff,
add attendance in
-
CurrAttendance
-
MonthAttendance
-
Cycle
(Wait for completion then proceed to next!)
Parallel execution was not possible because each organization has a single currentAttendance doc and for each staff, attendance has to be added there. If there was parallel execution, there is a high possibility of concurrent writes to a single document.
As a solution, I rewrote the code like this:
-
add attendance of all staffs of an organization at once in the CurrAttendance doc
-
Run in parallel :
-
For each staff, add attendance in
-
MonthAttendance
-
Cycle
-
-
I waited for one more Sunday to see the improvement in execution time. But the problem was not yet solved. It failed once again! No error thrown this time. CurrAttendance marked successfully but MonthAttendance & cycle of staff was not updated making the data inconsistent. Thanks to Mohit, who had written a script to mark attendance & make the data consistent again. But the ClFn problem was not yet solved.
Acknowledgement problem
ClFn was simply ignoring the parallel execution code to mark attendance. After much brainstorming & surfing, I found that ClFn needs to acknowledge when the execution completes. ClFn gets terminated after the last acknowledgement. The last acknowledgement in the ClFn was of creating CurrAttendance docs. I was bound NOT to send acknowledgement of parallel execution attendance marking code. How could I? How do I know which staff is the last one to finish marking attendance? It’s all network requests. Any request could be the last one. So, the previous code was such that it didn’t return any acknowledgement for marking attendance. Last acknowledgement was of creating CurrAttendance docs.
I was looking for all sorts of solutions (including RxJS) for returning this acknowledgement i.e. waiting for all staff’s attendance to be marked & finally telling Firebase that “hey, I’m done - now you may terminate”. Finally I found the solution - Promise.all(), which is a very powerful function in Node.JS. It combines multiple requests and sends acknowledgement only when all have completed execution. I quickly rewrote the entire code using Promise.all().
Can’t wait!
I couldn’t wait for another Sunday to see if my solution worked. I had taken a backup of the entire data (7.7k docs) on Saturday night to restore in case ClFn makes the data inconsistent again. I quickly created a dummy project, imported the data and this time I tested the ClFn with actual data. Finally - finally, it worked! It marked attendance as expected for all the 173 staffs. Guess the execution time!!
Reduced by 144x
Yeah, the rewrite of ClFn made it’s execution time reduced by 144x. From ~8 mins to just 3235 ms!
How scalable now?
The new code marked attendance of 173 staff in just ~3 seconds. Taking the same speed, it can mark attendance for as much as 31,000 staff in single execution (540 seconds is the maximum timeout limit of Firebase ClFn) !
Now one might ask - What happens if the number of staff grows even more? It will require one more rewrite. We need to execute ClFn multiple times and divide the staff among different executions. For example, 3 executions can mark attendance of 93,000 staff!
So, no more inconsistency?
No, can’t promise that. Consider this : The cloud function first creates currAttendance docs for all orgs and then marks attendance individually for all staffs. In case, if any error occurs while doing so, for example - attendance marked in monthAttendance but failed to increment noOfPresent in cycle, then this will cause inconsistency. Yes, even with current code. One might suggest to use Transaction & WriteBatch. Good solution, but it comes with its own limitations. Firebase supports a maximum of 500 writes per transaction / WriteBatch. Hence, I got rid of all previous WriteBatch in code to avoid scalability issues in future.
However, the code can be rewritten to perform separate WriteBatch for each staff. But again, concurrent additions to currAttendance will cause inconsistency problems. Pff…
What now? What if data goes inconsistent? Well, there are now logs for each error that takes place. So, it is required to regularly monitor ClFn health for any errors. And then manually correct them!
And it’s resolved!
However, for now the problem is finally resolved with parallel execution & Promise.all(). Here are the lessons I learnt from this entire incident :
-
If it works now, it doesn’t mean it will keep working!
-
Scalability comes with it’s own challenges
-
There is always a way out
-
Never give up
-
Writing code in Node.JS is hard but not impossible
-
Challenges of working with Firebase Cloud Functions :
-
Only Node.JS!
-
There are emulators, but you might not know the bug until it’s executed with the actual data
-
That’s it! This was how I tackled this - one of the biggest coding challenges in my life.
Thank you for reading.