Using AI to save time and money, hopefully

17 Apr

Using AI to save time and money to replace old software is not new. Still, DOGE and the US Social Security Administration are about to run the largest experiment in this area on the planet. It may work, though the odds are not in their favour.

DOGE is starting to assemble a team to migrate the Social Security Administration’s computer systems entirely off one of its oldest programming languages in a matter of months, potentially putting the system’s integrity and the benefits of 65 million Americans at risk[1].

Their target is a massive computer system that provides the backbone for SSA operations, from assessing claims to paying benefits. The system is ancient, with new subsystems bolted into it over decades. Written in COBOL, a programming language used on mainframe computers dating back to the 1960s, a replacement is long overdue.

The current system is costly to run, hard to maintain, difficult to understand, and creates errors. Replacing it has been on the SSA priority list for years and has always been seen as a massive undertaking in terms of code development and migration effort over several years, not several months.

DOGE’s unconfirmed approach is to use AI to understand how the SSA software tools work and then use more AI to replace that software with a better version. The resulting system will use even more AI to assess claims, speed up SSA processes, enable quicker development of new subsystems, and ensure correct payments to claimants. All this decreases the risk of fraud or error and the need for staff to manage the process.

A simple example maybe based on real life

Using AI for these tasks is not a new idea. We deployed similar concepts in the 2000s, replacing human processes and archaic code with automated inference machines. Those efforts did not work, yet the lessons still apply today. By coincidence, Anthropic has just released research on this topic, showing that AI, after over 25 years of research, can still be a problematic tool to understand or use.

So what can go wrong based on trying this before? Let’s set up a simple example that is not real and used for purely illustrative purposes. But all of this story, in a related world, is true.

A bank branch transfers the same sum daily from a single account to 10 others. That branch manages all the accounts, so they have the same identifying six-digit sort code, and all accounts use eight-digit account codes. The teller enters the sort code once, then the paying account code, the ten account numbers, and inputs the sum for each account—let’s assume $100.

An old computer running old code processes this action over several hours. The computer looks up the accounts and transfers $100 from the paying account to those accounts. The payment must be made that day, yet there is no immediate rush to speed up the process as long as it finishes by 5 p.m. If there are any errors, the bank teller gets an alert, examines the account details, and manually makes any changes.

Everything works perfectly fine. The teller sets up the payment each morning, checks that they are progressing at lunchtime, and fixes any errors before they depart for the evening. The result is 11 happy customers – the person paying and the 10 people receiving $100 daily. Plus, the teller feels their job has helped 11 people by merely entering a few numbers and checking the odd error.

In setting up the system, the coders made a few simple decisions to make it effective and efficient. First, these transactions are still processed centrally on a mainframe computer to ensure that records are kept and reduce the risk of fraud in a single branch. So, communication between the branch system and the central mainframe must be in a similar language.

Then, before international enterprise-scale organisations deployed global software tools, a handful of people in the firm conducted the development of these tools. A small team would write the code and discuss new features. Most of this was written before Agile as a concept existed and when recording information about what a routine did was quite rare. The people who built the code also ran it and often sat next to the machine running it, so if anything went wrong, they knew how to fix it.

Co-location meant that fixes could be done quickly, often over the phone, while a developer worked through the issue. It also created an added bonus for the developers by creating a job for life. The programmer had to be kept happy if you wanted the program to run. Most organisations followed this model, but it created lines and lines of code that only a very few people understood.

When computers began popping up in other locations, these same developers would copy part of their code onto those local machines and let it run. Doing so allowed them to distribute code they knew would work with their central system with little effort.

Testing was often done in-house and by hand, so developers would take time off writing new code or supporting current code to run tests. In the 1990s, big software companies properly tested their code, but in-house teams would make sure that it produced the answer that they expected and then distribute it. After all, if it went wrong, they were just a phone call away to fix it.

Errors, errors, errors

Back to the bank teller, happily entering 10 account numbers in the morning and checking progress between helping other customers in the bank face to face. It was still a time when people went into banks in real life for most of their banking needs.

The tellers would notice that errors were rare but would often be similar. For instance, there is a very, very small probability (1 in 100,000,000) that two account numbers could be the same. Still, it is highly unlikely that sort codes would also match (equivalent to matching two human hairs from a line of hairs stretching 7 million kilometres). There is also a possibility that the connection would drop between the branch and central mainframe during a transaction, which was much more likely.

The highest technical risk would be that the branch computer crashed from getting too hot, a power cut, or just a bug in the computer. Tellers would enter numbers in small batches to avoid these errors and let them run. Then, if the connection failed or the computer crashed, they just had to enter 10 accounts rather than hundreds at a time.

Of course, the most significant errors were caused by humans: selecting the wrong account numbers for payment, entering the wrong account numbers into the machine, not checking that payments were completed, or writing errors into the code and not testing it.

Another quirk of old systems involved security. Installing an application on a works device is difficult today. Back then, most systems were open if you knew what you were doing, and code was relatively easy to hack.

Consequently, local users added local code to systems. A smart bank teller could access their computer and add a new, local routine. This was much quicker than calling the head office to ask for a new feature, which often resulted in being put through to the coding team, who would be too busy doing their own thing rather than adding user-requested features.

Tellers could add features that check account numbers. If they entered the same sort code every day, they could add a quick routine that filled this out automatically. They could store the accounts to be paid daily on a separate list and look that up rather than manually add each account. Users could set up message notifications. In one case I know of, people would use the equivalent of an account entry box to communicate with other tellers in a primitive routing tool using sort codes. Today, this seems very wrong. Back in the 1990s, hacking was just part of the job.

The central office was too busy to check if this was happening, and where it produced errors, these would be locally contained and managed. Plus, the machines were relatively slow, and multiple people were involved in the actual process, so total errors were few.

AI Bubble of 1990s

Now, let’s introduce the AI Bubble of the late 1990s. Inference machines were all the rage, and inferring information within data sets offered huge gains. One example from our fictitious story would be to use inference to reduce the number of digits processed, which appeared a simple task. In the sequence 12345, the next probable number would be 6. If communication was lost and the sequence was 12?45 then the inference machine would assume the missing number is 3, especially if it had previously seen 12345 regularly.

Computers ran slowly, especially with extensive lists of numbers. If you could halve the number of digits used by only using the last four digits of an account number and assuming that the odds of the same number were still 1 in 50m, then the machine could process those accounts almost twice as fast. If you could infer the missing digits when communications dropped, then you didn’t necessarily need to reconnect and error check; you could proceed as if the connection had remained in place. All these small inferences would save time, and when compute power and storage were costly (read, before the cloud), every digit counted.

Plus, bank tellers still checked the payments each night, and, as a last resort, customers would come into the branch and ask about their missing $100 payment.

Initially, like DOGE, we looked at how inference machines could replace the code. Could we port it from an ancient coding language and make it more effective and efficient? Could it be a rosetta stone for ancient languages?

Alas, with colossal code bases of jumbled routines and sub-routines, without any explanation as to what it all did, and with the original coding teams either retired or, more likely, unwilling to assist in removing their job for life employment managing the code, this approach was doomed.

What about improving the process and using inference machines at key points? Everyone knew that local branches were running their own code, often to improve their specific work, and helping with that could be beneficial. Looking at what they were doing, several patterns emerged: regular activities with similar information that could be repeated and replicated. AI thrives on repeatable patterns, after all!

At the same time, computer hardware was massively increasing in performance and significantly dropping in price. It became possible to start running inference routines locally and, with the emergence of a more stable internet, centrally collect local data more reliably. Rapidly, it became feasible to reduce the work done by local tellers, centrally run processes at hugely increased rates, and collate error detection with a smaller team centrally managed. This centralisation would allow more tellers to conduct work face-to-face with customers and reduce the security risks of local branches running unauthorised code.

For comparison, rather than running 100 payments daily, these changes enabled 1000 payments — every second. That speed increase also raised our fictional model’s daily error cost rate from £10,000 to over £28 million in just 8 hours.

Of course, it went wrong

First, errors are relative to the number of processes run and the completion time. 100 payments over 8 hours with a human checking at the start, middle, and end of that activity would reveal few errors and be quickly manually fixed. Even at a slow rate, 100,000 processes per day would naturally reveal more errors. Even if 1 in 100,000 processes created a mistake, then this immediately became a daily activity rather than something seen once every three years.

Doing more things simultaneously also allowed new things to be attempted. Payments could be scheduled, multiple account transfers could be linked, and transaction recording could be simplified. Suddenly, that legacy code that had been super accelerated really didn’t look that great. Unstructured, poorly documented, incrementally built code had a lot of errors within it just waiting to break free. With many humans in the loop and at a slow rate of discovery then these could be fixed by calling up the central developer team and having a chat. With thousands of errors suddenly being unleashed at once, the developer team became swamped and unable to respond. Plus, morale plummeted as all the bad coding practices they previously ignored started returning to the team.

More importantly, the AI started to show weird errors, not all bad. AI unearthed oddities that were previously missed but then appeared obvious when seen. Odd quirks in the system became apparent. For instance, some account numbers would appear again and again. Often, this would be genuine fraud or crime, with money being syphoned off without permission.

Other times, the inference routines would get stubborn or just simply grumpy. Rather than predict 6 for 12345, they would create 1 or 9. Why? Genuinely, people found it hard to work out, an issue that continues with AI today[2]. Data analysis would point to Newcom-Benford Law for Anomalous Numbers[3] as a possible cause for random numbers not being random, but the inference would still sometimes act oddly[4].

Overall, the inference systems failed. It accelerated too fast, did not truly understand the base code, had no way of addressing unknown errors within the system, removed the humans that made the systems work (the tellers), and annoyed the humans tasked with fixing the system (the developers). Most importantly, customers suddenly became grumpy when machines failed to complete simple transactions. All because a clever AI was being used well beyond its limits.

Top Three Recommendations for turning AI loose

What would be the three top recommendations before DOGE turns AI loose on SSA?

Spend more time on the base code to understand what it is doing. AI can help here, but it still cannot replace human insight. The system is probably not doing what it says it is doing, AI is not doing exactly what it says it is doing, and AI is unable to determine whether that is an intended or unintended consequence. In short, AI doesn’t know why it made a decision.
Test the current system and use AI to run those tests. Test with extreme corner cases and larger data sets than expected. If you cannot test on massively scaled data, run on minimal data until you can test it at scale. Sixty-five million humans is not a test data set. It’s their lives.
Include the people who built the machines and operate the processes. They know how it all works, where the issues really exist, and the workarounds that are employed to get the job done. They are the actual hackers of the machine and need to be involved rather than excluded.

Anyone can break things, but breaking things and making them better needs experience of the actual thing being improved. Years after my experiences breaking things and not always making things better, I heard a German phrase, “die verschlimmbesserung”. This phrase is my favourite from a nation famous for using single words to express an entire sentence. It describes an action that is supposed to make things better but ends up making it just a little bit worse.

Misused AI can be the epitome of this sentiment. It is well-intentioned but not always understood, promising huge steps to improve things but resulting in making things worse. People are just trusting AI to improve their lives, but it’s delicate. Getting an AI deployment wrong at the scale proposed in the SSA might not just impact 65 million humans who desperately need the payments that SSA provides; it may tar the whole deployment of AI and create another AI burst bubble.

[1] DOGE Plans to Rebuild SSA Code Base in Months, Risking Benefits and System Collapse

https://www.wired.com/story/doge-rebuild-social-security-administration-cob

https://www.wired.com/story/doge-rebuild-social-security-administration-cobol-benefits/

[2] Anthropic recently published a paper on this phenomenon: Circuit Tracing: Revealing Computational Graphs in Language Models’

[3] Benford’s Law: Explanation & Examples

[4] Better explained here ‘AI Biology’ Research: Anthropic Explores How Claude ‘Thinks’

[5] DOGE’s new plan to overhaul the Social Security Administration is doomed to fail