GenAI: Separating the Signal from the Noise

Apr 23

Introduction

This post introduces a small but interesting problem whose solution includes—perhaps even features—the integrated use of a large language model (LLM). It has three objectives:

To illustrate the maturity of the LLM (Generative AI) space and the relative simplicity of integrating artificial intelligence components into a common workflow.
To highlight the factors that need to be weighed in selecting what model to introduce and how to introduce it into a given problem space.
To present and defend the lesson that traditional issues in system design—from requirements gathering to interfaces, integrations, database management, testing, and deployment—continue to represent the larger share of influence on a project's success, and that for any interesting project, the "AI" component is only one part of a larger whole.

We're going to look at what might be a reasonable weekend project for the average systems engineer. You will not need a technical background to read and understand the problem or the nature of the steps to solve it, although anyone wanting to replicate the process or results would want to have a familiarity with POSIX systems administration, database definition and management, setting up and configuring open source large language models, and some kind of high-level language (in my case it was Python) with bindings to a Gen AI framework.

*Too many emails, image courtesy of BBC*

Overview of Email Management Challenges

The problem of "signal and noise" is a common one in the current "information age." On a daily basis, all of us are bombarded with more and more information, most of which has little to no value to us, but from the point of view of the sender is worth distributing in unfathomable volume because of the low per-unit cost of doing so, and the potential high economic value of reaching even a small number of people willing to part with time, money, or influence as a result.

We call the broad category of this mostly advertising material "spam," usually in contrast to "ham." Spam is, by definition, low-quality, high-volume information, usually from one sender to a large number of recipients. Ham, by contrast, is high-quality, low-volume information, usually one person or organization to another.

Often we formally define "spam" as "Unsolicited Bulk Commercial E-Mail" (UBCE), which is a nice way of saying that it's a lot of junk about buying stuff we don't want.

A broad characteristic of "spam" is that it also tends to be stylistically different than "ham," meaning that, very often, a stochastic filter (like a Bayesian filter) trained on the stylistic differences between e-mail that is important to us and e-mail that is not tends to be pretty good at distinguishing between what we care to be informed about (a close cousin's wedding, for example) and what we don't (these days, commonly a fictitious subscription to a service like the Geek Squad or Norton LifeLock).

This much is easy.

Defining the Problem

The problem gets to be more difficult as it becomes harder to distinguish between "spam" and "ham," or, more generally, between signal and noise.

For many people, the Internet is a giant marketplace, and many of us will find nooks and crannies in it to buy all sorts of things, and, often in those transactions we'll sign up for "infrequent" updates from the companies we do business with to learn about new products, sales or promotions, or other company news that might be relevant as we make purchase decisions in the future. We might also sign up for breaking news updates, columns from writers on platforms like Substack or Medium, local community events, alumni associations, Meetups, clubs, entertainment venues, or newsletters from artists prone to dropping viral albums in the middle of the night.

As of the date I am writing this blog, there are more than 515 publishers, companies, institutions or groups who regularly send me e-mail of this type. I receive more than 145 individual pieces of e-mail every day across all of these senders. Some mail infrequently; one menswear company, by contrast, can send me sometimes three separate e-mails every day. (Why they think I have clothing purchases on my mind more than a thousand times a year is a mystery to me.)

The volume of mail I have chosen to receive is so large that it dwarfs actual spam—low-quality, unsolicited e-mail—by more than 20 to 1.

I want to continue to receive it, but this volume of commercial messaging causes several problems:

It is difficult to separate personal messages from commercial messages, and easy to lose track of e-mail that needs a reply, especially if I fall behind on checking e-mail for a couple of days while traveling.
It is time-consuming to go through and read, file, or delete this volume of e-mail on a daily basis. Even at just 5 seconds per e-mail, this amounts to almost 15 minutes a day reading sales pitch after sales pitch, and that doesn't include reading breaking news summaries or newsletters with content that might or might not be interesting or relevant.
The volume creates a data management/data retention problem. Some things I might be comfortable deleting immediately. Some might be relevant for a few days, others for maybe a couple of months. Some information, of course, is important to retain for purchase records or long-term reference. No one has time to think about data retention for commercial e-mail over and over again: 15 minutes a day would quickly grow into hours a day: that just doesn't work.

These problems, of course, don't only apply to a single person's e-mail volume. They are relevant for many kinds of data management and data governance process in any size enterprise. What do you do, for example, with all the logs from your support chat transcripts? All your customer correspondence? All the notes from meetings where business decisions were made that have only a specific temporal relevance? Do you throw it into your "data swamp" until it's time to drain it all and start over?

Whether individually or as enterprises, we face the same dilemma every day: there is a ton of information, it takes too long to go through, it is unclear what is important when and for how long, and we are never quite sure when we are done with it or whether we are going to keep it, for no real benefit, forever.

Requirements for Email Management

With the proliferation of products now "integrating" AI, there are plenty of e-mail clients to choose from that will do their best to summarize and categorize e-mail by topic, sender, or content. Most of these will, for example, tell you whether a given piece of e-mail is a newsletter or a commercial e-mail. Some will have a kind of "white list" operation that will hold e-mails, and potentially block senders, that seem suspicious or new or don't fit into the usual pattern of emails that you read.

Like with many products, though, these are great for their own promotions and company announcements, but only provide the illusion of solving the problem. For the most part, these tools will be simple categorizers or summarizers slapped on top of an existing interface or workflow, providing a little additional information but not actually materially affecting the way you manage incoming content or how much time you spent doing it.

I didn't keep track, but I'm pretty sure that if I had, I would find that I spent about ten times as much time trying to understand what I was really trying to accomplish as I did building the system I will describe that accomplished it, and I think that that was the right ratio.

Here is a rough reconstruction of the requirements that I set out to meet:

More than anything else, I wanted to stop being distracted by the constant stream of inbound e-mail. The ideal system would allow through personal and urgent e-mail, accompanied by the usual notifications. Everything else could wait, ideally, for me to look at it all at once.
Apart from meeting that basic requirement, the ideal system would minimize the amount of time that I needed to interact with the total set of e-mail on a daily basis. I set an objective target of no more than 3 minutes per day.
The ideal system would allow me to process filtering my e-mail for action in a separate step than managing any reading, research, or follow-up, which I would consider a separate step, and it would integrate seamlessly with the task management system I use for all other reading, research, or follow-up outside of e-mail. In other words, if there were three articles I wanted to read or three things I wanted to purchase, my objective was to flag those things in those 3 minutes, and then do the follow-up separately.
Whatever system I set up would have to be cost-resilient, meaning that if I changed my mind about how I wanted to save, summarize, or process things, I would mostly favor doing so in a system with a large sunk cost and low ongoing costs, rather than a system where deciding to re-evaluate decisions could require a large financial trade-off. For me, this is a fundamental principle of innovation: if there is a high cost for iteration, there is a high bar, and a strong disincentive, to iterate.

The System Design

These four basic requirements naturally led to a simple system:

I would maintain a local mirror of my cloud-based e-mail, and specifically an online and near-line split between active and old e-mail, all of which would be reasonably local to my home server, regardless of what cloud provider I happened to use. I specifically decided to mirror using offlineimap. A cron job maintains the mirror at a reasonably frequent refresh rate; a separate Python script purges the mirror (and, by extension, the cloud e-mail) of e-mail that is three days old, saving a copy in a Maildir locally and backing that Maildir up to a second cloud provider (using BorgBackup in a GFS configuration) daily. The net effect is that the amount of e-mail hosted by my cloud provider is relatively small, and anytime I want to run repeat analytics on the entire body of my e-mail, it is all local and highly performant.
I would maintain a database of all my e-mail as far back as I have it. The database would include the metadata from the email itself (sender, subject, date, and so on), a reference to the original e-mail, and the AI-generated summary of that e-mail. I would keep this, again, local, but back up to a cloud provider frequently. (This being said, the backup of the database is not a requirement, as I can re-generate the database on demand at any time, even if it won't be exactly the same from one generation to the next. Still, because the system is always being tweaked in one way or another, it is useful to have a record of what the data looked like at any given time, should I introduce a bug at some point and need to examine its effects.) I would use this database for all operations, like generating reports or statistics, summaries, tasks, and so on. In addition to being backed up, I keep this database in a dedicated directory that is available for replication to any laptop or workstation I happen to be at, over Unison. Unlike my Python code, which I maintain under version control in several git repositories, I don't keep this database under version control because my needs don't justify the complexity of managing an LFS-enabled git repository for the binary data. The backups and Unison suffice for my purposes.
I generate the summaries by running a Python script as a cron job on my home server, against all locally-cached e-mail, and store all results in the above database. I use llama-cpp and mixtral-8x7B for the language model and the lama-cpp bindings for Python for the script. The script is smart about mail refiling and duplicates. There were a few small hurdles, probably the most annoying of which was managing multi-part content (HTML and text) and making sure that which version I wound up selecting produced the best summary. The longest part of the technical exercise, however, wasn't writing the code—it was the painful process of experimenting with a dozen different LLMS and prompt engineering for each to arrive at the best balance of performance, brevity, complexity, and results, and then refining the selected prompt further to favor important content over template material like legal disclaimers and unsubscribe information. It's hard to overstate how easy it is to write a wrapper around an API call to an LLM compared to the seemingly endless tedium of tracking experiment after experiment across dozens and sometimes hundreds of e-mails to arrive at the best-performing model and prompt combination overall.
I've set my email provider to deliver all inbound mail to a Staging folder; a script then decides what to move to my Inbox or to the Archive, Trash, or Junk. Everything that goes into either the Inbox or Archive is summarized. This creates a small delay in e-mail receipt, but under most circumstances this is perfectly acceptable. If I happen to be expecting something like a security code or other urgent email, I'll see it show up in the Staging folder.
I maintain a list of senders and where their correspondence falls in importance in my e-mail categorization, across four categories: high (e-mail that I always want to highlight), low (e-mail that's generic), suppress (e-mail that I want to keep but that I don't need to review until I care to search for it), and junk (e-mail that gives me great pleasure never to see or review at all). I do not rely on the AI to do this. I say this knowing that purists will say that this is not really an AI system, but I also did not set out to build an AI system. I do not trust AI to categorize e-mail for me in this way because, in my opinion, my moods, my interests, and my circumstances are too fluid for me to believe that an AI is going to get this anywhere close to right: I am interested in GAP e-mails, for example, but only when they feature a collaboration with my favorite designer. I am interested in e-mails from my favorite ramen supplier, but only when I am running low on the product I already have. I am interested in the news when there is a global conflict that might affect people or interests I care about, but I am less interested in the daily run of national political drama. I am sure that if I put in the time, a sufficiently advanced AI could develop a set of temporal rules that would keep track of all of this for me perfectly. In the meantime, I am not a purist, and my list is pretty good.
Every morning, I sync the database to whatever local machine I happen to be using, and query it for the prior day's email, sorted by category type and then alphabetically by sender. I retrieve sender, subject, send date/time, and the summary of the e-mail as stored when it was processed, and create a structured file in org-mode, with high and low priorities as Level 1 headings, each e-mail as a Level 2 headings, date/time as a folded property, and the email summary as content. I have always been on the correct side of the great Emacs/vim wars and as a result, I do most things in Emacs from task management to coding, journaling, and even graphing. The task buffer opens to a folded state, and I am able to glance through all the senders and subjects as an outline, quickly peeking into the short summary to see if it is worth following up on. If it is, I've defined a quick key combination in elisp (C-c r) that takes the e-mail summary, reformats it as a to-do item, and moves it to my Task.org file scheduled for the current day. It takes a couple of minutes to go through the High-priority emails in this way and then maybe 30 seconds to briefly scan through the Low priority ones. Other than occasionally adding a new sender to the category list, this is the only part of all of the listed steps that I perform every day. Using the system, altogether I have reduced the amount of time I spend looking at or thinking about email by about 90-95% per day.

A Note about Real and Imagined Requirements

I started this project shortly after OpenAI released ChatGPT 4. I know that I would have four possible paths for this project and others that would come later:

Build the entire system as described, but perform the summarization using an API call to the ChatGPT service. (As I stated at the top, you'll note from all the steps involved in setting up this system that, largely, how to interface with the AI "core" is really one of the smallest. Deciding early on using a service would have simplified some of the comparison steps and possibly shortened the prompt generation cycle, but I would stand by the assessment that the AI interface is still the smallest part of all of it and changing to a commercial API service would not have materially affected either the process or the outcome.)
Build the entire system as described, but using an on-demand cloud service that those provided by HuggingFace or other entities. (This would have meant using an open-source LLM but running it on someone else's hardware rather than my own, at incremental rather than fixed costs.)
Deploy on hardware in my home office, using the current commercial standard: a beefy Linux-based server with high-end NVDIA GPUs.
Deploy on hardware in my home office, using Apple Silicon.

I did some experiments with ChatGPT 4 and decided against that approach for purely economic reasons. The system gave great results, but every run, revision, and test cost some fraction of a cent, and over the scale of more than a hundred thousand e-mails, it wouldn't have taken long before I would have spend more on API costs than I would have spent on hardware over its predicted life. More than the actual numbers, however, was the plain psychology of it: I wasn't intending to build a commercial system whose use I could monetize over a large client base. I was—outside of solving the actual problem—interested in coding and tinkering and having some fun, and these goals were not consistent with a clock ticking, charging a little more every time I wanted to try something new at scale.

This concern also doomed the second option: running on cloud hardware by the minute, by even a larger margin. As companies know that have moved from internally-hosted data center operations to cloud service providers, costs are always much higher than you expect. You pay for systems while they are idle, you pay for systems while they spin up and spin down, you pay for systems across production, QA, and development platforms. The worst part of the experience here, for me, was the waiting. I am sure there is a market for these services, but that market is not me.

In the end, the decision came down to a powerful Linux server versus Apple Silicon. The advantages of the Linux server were scalability and capability across both running, training, and fine-tuning models. They also came with screaming speed. The advantage of Apple Silicon were cost and simplicity.

In the end, Apple Silicon won out for the most banal of reasons: power and heat. The server I'd have been most likely to purchase would have cooked the artwork on my walls, driven my air conditioning bill into the stratosphere, and routinely fried my house's aging circuits. The first time I'd have lost power to the oven and ruined a great loaf of bread because a box in the corner of my office was summarizing an exceptionally long Substack newsletter, I'd have regretted every dollar I'd spent. Sometimes technology decisions boil down to something prosaic. In this case, the bread won.

Summary and Next Steps

All of this is great material for a long story at a cocktail party. Outside of the lessons about requirements, design, and technical deployment, however, how is it relevant to the introduction of AI tools into the enterprise?

I started this post with three objectives:

To illustrate the maturity of the LLM (Generative AI) space and the relative simplicity of integrating artificial intelligence components into a common workflow.
To highlight the factors that need to be weighed in selecting what model to introduce and how to introduce it into a given problem space.
To present and defend the lesson that traditional issues in system design—from requirements gathering to interfaces, integrations, database management, testing, and deployment—continue to represent the larger share of influence on a project's success, and that for any interesting project, the "AI" component is only one part of a larger whole.

Although it may not be obvious at first glance, my e-mail problem is your chatbot problem: we both have a problem with signal and noise.

Moreover our problems share these truths in common:

It is relatively easy to conceptualize the value that GenAI would bring to help separate signal from noise.
It becomes a little more complicated once you have to start making decisions based on the strengths or weaknesses of specific models and how the introduction of GenAI will change both your employees and their environment.
Understanding your actual requirements, building out your associated systems, and safely introducing an essentially stochastic element into your hopefully largely deterministic processes will be more important to your success than anything else.

Introducing AI solutions like Retrieval-Augmented Generation (RAG) may seem to be the obvious way to pull actionable insights from an incomprehensibly large set of company documents and data. But using AI to mine, summarize, and generate responses from volumes of source material might make your problem better, and might make it worse.

Why that is and what to do about it are two key questions we will explore in Part II.