<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Classification on brege.org</title>
    <link>https://brege.org/tags/classification/</link>
    <description>Recent content in Classification on brege.org</description>
    <generator>Hugo</generator>
    <language>en</language>
    <copyright>Copyright (c) 2016-2026 Wyatt Brege</copyright>
    <lastBuildDate>Sun, 12 Apr 2026 21:45:09 -0400</lastBuildDate>
    <atom:link href="https://brege.org/tags/classification/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>20 years of email</title>
      <link>https://brege.org/post/email-analysis/</link>
      <pubDate>Mon, 01 Sep 2025 00:00:00 +0000</pubDate>
      <guid>https://brege.org/post/email-analysis/</guid>
      <description>Making my case that satisfaction surveys are the new email cancer.</description>
      <content:encoded><![CDATA[<p>I&rsquo;m not intentionally a data hoarder. I just haven&rsquo;t been an effective or aggressive email deleter or filter user. This has changed some in recent years, as the techniques for spam emails have evolved to covertly trojan &ldquo;survey&rdquo; subterfuge into my mailbox.</p>
<h2 id="i-have-survey-fatigue">I have survey fatigue</h2>
<p>Surveys are marketing emails. I can&rsquo;t believe I used to take the time to respond to some of them. An analysis on my email history has shown that my hunch on survey spam is correct. Around 2023, I began marking all surveys as spam, and I&rsquo;ve got the data to prove just how rampant companies have used surveys to get their brand in your inbox.</p>
<h2 id="emails-that-matter">Emails that matter</h2>
<p>My main goal in exploring my emails in-depth was to build a predictor of whether an email had any <strong>future usefulness</strong>.</p>
<p>Mail from friends and family, correspondence with students, receipts and financial records, etc. fit the <strong>binary of keep</strong>. After I manually processed this massive backlog of email (with great help of Thunderbird&rsquo;s filters), what I found I discarded most, besides spam, were surveys, newsletters, and other mass mailers.</p>
<p>What I found, qualitatively, was:</p>
<ul>
<li>imperfect spelling, capitalization, and grammar</li>
<li>little-to-no HTML markup</li>
<li>all emails meant only for me sans phishing and spam</li>
</ul>
<p>These traits defined true keepsakes.</p>
<h2 id="introducing-sanoma">Introducing sanoma</h2>
<p><a href="https://en.wiktionary.org/wiki/sanoma">en.wiktionary.org/wiki/sanoma</a></p>
<pre><code>sanoma (noun) Finnish 
  message, communication (a communication or the content of a physical
  message; also the message contained in some act or expression such as a
  work of art)
</code></pre>
<p><strong>sanoma</strong> (<a href="https://github.com/brege/sanoma">github.com/brege/sanoma</a>) uses YAML workflows to define multi-step analysis pipelines. The workflow runner automatically discovers and executes tools from the <code>sanoma/analysis/</code> and <code>sanoma/plot/</code> directories, making it easy to chain data extraction, filtering, analysis, and visualization into reproducible pipelines.</p>
<p>I developed this YAML workflow method in my Markdown-to-PDF project&ndash;<strong><a href="https://github.com/brege/oshea">oshea</a></strong>&ndash;where I realized comprehensive end-to-end tests were just manifest workflows. It&rsquo;s an intuitive way to string command line sequences together. The <em>pipeline</em> term in machine learning/data science is congruent to this system.</p>
<h2 id="data-mining">Data Mining</h2>
<p>While much of this can be done in a Jupyter notebook (far easier to refresh plots this way, although <code>:MarkdownPreview</code> in <strong>Neovim</strong> is sufficient), I built this project as a way to data-mine my own activity. I also want to create a visualization harness for many things on my computer:</p>
<ul>
<li>text message history</li>
<li>email history</li>
<li>screenshot frequency</li>
<li>browser history and bookmarks</li>
</ul>
<p>Because email is text-based, and because my first concept of &ldquo;AI&rdquo; was the need for combative spam filters that have been built over the last thirty years, email felt like a good starting point.</p>
<h2 id="grad-school-emails">Grad-school Emails</h2>
<p>The monthly timeline reveals the academic year rhythm: high volume during active semesters with dramatic drops during summer breaks and winter holidays. The 2016-2017 dip corresponds to the dissertation defense period, where militant email sanitation was a reprieve from LaTeX and simulation monitoring&ndash;hence the dip.</p>
<p>My personal dataset has about 35K emails between my grad-school emails and <a href="https://brege.org">my current website&rsquo;s</a> personal email. Not included are my Gmail and undergrad email(s). I plan on synchronizing those at a later date.</p>
<h3 id="grad-school-timeline-seasonality">Grad-school Timeline Seasonality</h3>
<p><img alt="Grad-school Emails (monthly)" loading="lazy" src="/post/email-analysis/img/wsu/timeline.png"></p>
<p>WSU&rsquo;s Okta system required changing passwords every 6 months, and some time after my defense my account died. I am thankful that I had a Thunderbird profile tucked away on a drive that allowed me to recover all of my university emails.</p>
<h3 id="grad-school-and-onward-histogram">Grad-school and onward Histogram</h3>
<p><img alt="Grad-school Emails (yearly)" loading="lazy" src="/post/email-analysis/img/wsu/histogram.png"></p>
<p>The year-over-year histogram demonstrates consistent academic seasonality, with September-April peaks and May &ndash; mid-August valleys across all years of graduate study. Even with teaching summer labs, the bureaucratic pressure in the summertime dies. I loved teaching in the summer.</p>
<h2 id="spam-marketing-and-surveys">Spam, Marketing, and Surveys</h2>
<p>The spam timeline shows minimal marketing emails pre-2010, followed by a sharp increase around university enrollment. By 2015, spam reached 60-80% of all emails and has remained consistently high. The GDPR implementation around 2018 created a spike in <code>unsubscribe</code> language as companies scrambled to comply with new regulations.</p>
<h3 id="marketing-spam-trends">Marketing Spam Trends</h3>
<p><img alt="Spam Timeline" loading="lazy" src="/post/email-analysis/img/spam/timeline.png"></p>
<p>The tail in the beginning of this timeline is presented for context.
It only includes a &ldquo;purified&rdquo; hotmail account mailbox from my teenage years that extended a bit into my undergrad years. Those years overlap with Gmail usage (not integrated into this data) and my GVSU university email.</p>
<h3 id="keyword-buckets">Keyword Buckets</h3>
<p><img alt="Spam Keywords" loading="lazy" src="/post/email-analysis/img/spam/keywords.png"></p>
<p>Another useful filter for spam emails is checking for keywords like <strong><code>unsubscribe</code></strong> in the message body.</p>
<p><code>unsubscribe_bait</code> dominates with over 12,500 matches, followed by <code>satisfaction</code> surveys (~8k) and direct &ldquo;survey&rdquo; requests (~4k). This reveals how modern marketing shifted from direct sales to engagement-focused tactics requesting feedback and reviews.</p>
<h3 id="conclusion-satisfaction-surveys-are-the-new-email-cancer">Conclusion: Satisfaction Surveys are the new email cancer</h3>
<p><img alt="Spam Heatmap" loading="lazy" src="/post/email-analysis/img/spam/heatmap.png"></p>
<p>The heatmap (filtered to post-2010) shows &ldquo;satisfaction&rdquo; spam as the most persistent threat, maintaining 20-25% frequency from 2012 onwards. Survey-based spam shows steady growth, intensifying after 2020, when both GDPR constraints pressured companies to invent new angles of attack, becoming increasingly desperate for customer &ldquo;feedback&rdquo; (attention) during the pandemic. <strong>Satisfaction feedback surveys are advertisements.</strong></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
