High Volume Data Extraction
The Problem: Way Too Much Information
A major brokerage firm allows clients to trade stocks and mutual funds online. In order to provide up-to-date information to these clients, the firm relies on an organization called Morningstar, which rates and values every stock and delivers a report to firms every month. It is vitally important in the industry that the information be current by the 3rd of every month, in order to keep traders informed on performance. Firms such as our client offer warrantees to their clients, meaning that they risk serious losses if they fail to provide up-to-date performance data.
But taking the information that Morningstar generates and providing it to clients is not a simple proposition. The monthly files report in detail on every stock on the market — meaning enormous file sizes, from 16 to 32 gigabytes — and are uploaded on the Morningstar website for brokers to download. Of course, a download of that size takes a lot of time, and if anything interrupts it, it has to be started all over again. Our client would take three to four days to download, transform and load the whole file, and an employee had to monitor the download the whole time to make sure that no time was lost if an error occurred. Even with this precautions, there were times when the firm missed its deadline, meaning they lost credibility and had to pay out on customer warrantees.
The Analysis: Turn Down The Volume
In this case, there was an obvious solution that no one seemed to have thought of before. If the size of the file — the sheer volume of information — was causing so many problems, why not just break it down into a series of smaller files?
That was the easier part of the analysis. From there, it became an issue of how to effectively manage those smaller files to make the download process as efficient and reliable as possible.
The Solution: An Incredible Chain Reaction
Once we&rsqt;d put in place a program to divide the one huge file into sixteen smaller files, we were able to organize the download process in a way that seemed like common sense, even though industry giants had been struggling with the cumbersome downloads for years. Downloading a lot of small files one after another still takes less time, and is less vulnerable to failure, than downloading one huge one. But the process could be further improved by setting up the files to download in parallel, so employees wouldn’t have to wait for one part to download before starting the next part.
We automated the download process, creating a chain reaction in which finishing the download sets off the program to transform the data into a usable form, and finishing that transformation sets off the program to load the data. Using an event-driven mechanism meant that no one has to wait around to push a button to keep the process going.
Finally, we equipped the process with a number of error-handling mechanisms, to prevent that frustrating issue of an employee having to “babysit” the download to make sure nothing went wrong. We’d created a download system that didn’t need to be started over from that beginning if anything disrupted it, meaning no more costly missed deadlines. But even more importantly, we’d reduced the time needed from three or four days to just three hours, saving huge amounts of time, effort, money and risk.
