Data-Science on brege.org

Exploring my camera, screenshot, and image activity

Wed, 11 Feb 2026 16:09:29 -0500

This project part of a series of data exploration projects around my personal computer usage.

GitHub link: github.com/brege/image-activity

This is related to my email classification project: github.com/brege/sanoma (work in progress).

Overview

Generate heatmaps and histograms of image saving activity over hours, days, and months
Use file timestamps, modified-times, EXIF, and regex parsing for refined image discovery
Add bands and markers for major life events

Background

I wanted to determine if my image activity is dependent on major events and device purchases in my life.

do I tend to take more pictures during certain times of year?
how has my screenshot usage evolved over the last 15 years?
do I have “honeymoon” periods after a device purchase?
in what ways has my camera and screenshot usage changed between being an academic, chef, and developer?

I’m not a social media person, although my mastodon did see an uptick of usage following my hip surgery, where I began hiking and foraging a lot.

My image activity fits in three main categories:

camera: storage of camera photos from my phone
screenshots: screenshots on both my laptop and phone
internet: pictures downloaded from the internet

Gallery

I’ve marked in these first line charts, Camera Usage and Image Capture Concurrency, times when I’ve purchased a major device (a new phone or laptop) and a couple key periods of my life. These plots have all been normalized to a 0-100 photo count scale.

Camera Usage

From 2010 to 2017 I was a Physics TA and, following my 2014 physics prelims, a computational astrophysics doctoral researcher. I began attending conferences in 2015, exploring places around Pullman, WA during these researcher years there.

At the end of 2017, I left that life. I embraced my love of food and cooking and became a professional chef for a number of years thereafter, including the Covid-19 pandemic. This period of my life saw a greater number of photos taken: pictures of plates, menus, schedules, etc. My camera photos before this time were mostly non-work-related: travel, events, and pets drove image origination.

Image Capture Concurrency

Heatmaps

I only have one experience with online coursework: the data science bootcamp I attended in the fall of 2023. This period did not have a major impact on my screenshotting habits. There are three principal areas in which screenshot usage was more frequent:

The creation of my website brege.org around August 2016.
As an executive chef, screenshotting is recurrent for scheduling, text message records, receipts/purchase dates, etc.
Agentic-driven coding workflows, beginning midway through 2025, saw a surge in screenshot usage. Screenshots have become a large part of my front-end debugging workflow for web app development–extending well beyond data-structured Cypress end-to-end tests.

I did not find my screenshot usage noticeably change during my brief stint with online coursework.

In general, it appears that I take more screenshots on desktop earlier in the week and in the afternoon (averaged over the last ~15 years). To my surprise, the heatmap for screenshots on my phone have nearly identical densities. I assumed this would be biased toward the weekend and closer to 17:00 because of sports and restaurant dinner service.

Camera usage frequency, on the other hand, is made distinct by day of week only on density during Thursday evening and Saturday afternoon. It’s especially featured in both my Chef days and post-op mobility.

Histograms

By device and source, then binned on hours of the day, day of the week, and month of the year, histograms provide a finer distribution in one dimension.

For the hourly concentration of all three photo habits, my activity roughly follows a Boltzmann distribution.

These distributions generally peak at two distinct hours:

camera photos and screenshots center around 15:00
internet photos are generally concentrated around 20:00

Each bin is averaged for each picture type over the last 15 years, regardless of timezone.

Image activity generally increases at the beginning and end of standard university semesters, which also include the height of summer and the holiday period when I am always always travelling. Screenshotting is highest in the fall to mid-winter.

In my experience, restaurants are historically busier between, roughly, Friendsgiving and Father’s Day. Camera usage also largest during high summer. Beach. Hiking. Produce selection during chef years.

20 years of email

Mon, 01 Sep 2025 00:00:00 +0000

I’m not intentionally a data hoarder. I just haven’t been an effective or aggressive email deleter or filter user. This has changed some in recent years, as the techniques for spam emails have evolved to covertly trojan “survey” subterfuge into my mailbox.

I have survey fatigue

Surveys are marketing emails. I can’t believe I used to take the time to respond to some of them. An analysis on my email history has shown that my hunch on survey spam is correct. Around 2023, I began marking all surveys as spam, and I’ve got the data to prove just how rampant companies have used surveys to get their brand in your inbox.

Emails that matter

My main goal in exploring my emails in-depth was to build a predictor of whether an email had any future usefulness.

Mail from friends and family, correspondence with students, receipts and financial records, etc. fit the binary of keep. After I manually processed this massive backlog of email (with great help of Thunderbird’s filters), what I found I discarded most, besides spam, were surveys, newsletters, and other mass mailers.

What I found, qualitatively, was:

imperfect spelling, capitalization, and grammar
little-to-no HTML markup
all emails meant only for me sans phishing and spam

These traits defined true keepsakes.

Introducing sanoma

en.wiktionary.org/wiki/sanoma

sanoma (noun) Finnish 
  message, communication (a communication or the content of a physical
  message; also the message contained in some act or expression such as a
  work of art)

sanoma (github.com/brege/sanoma) uses YAML workflows to define multi-step analysis pipelines. The workflow runner automatically discovers and executes tools from the sanoma/analysis/ and sanoma/plot/ directories, making it easy to chain data extraction, filtering, analysis, and visualization into reproducible pipelines.

I developed this YAML workflow method in my Markdown-to-PDF project–oshea–where I realized comprehensive end-to-end tests were just manifest workflows. It’s an intuitive way to string command line sequences together. The pipeline term in machine learning/data science is congruent to this system.

Data Mining

While much of this can be done in a Jupyter notebook (far easier to refresh plots this way, although :MarkdownPreview in Neovim is sufficient), I built this project as a way to data-mine my own activity. I also want to create a visualization harness for many things on my computer:

text message history
email history
screenshot frequency
browser history and bookmarks

Because email is text-based, and because my first concept of “AI” was the need for combative spam filters that have been built over the last thirty years, email felt like a good starting point.

Grad-school Emails

The monthly timeline reveals the academic year rhythm: high volume during active semesters with dramatic drops during summer breaks and winter holidays. The 2016-2017 dip corresponds to the dissertation defense period, where militant email sanitation was a reprieve from LaTeX and simulation monitoring–hence the dip.

My personal dataset has about 35K emails between my grad-school emails and my current website’s personal email. Not included are my Gmail and undergrad email(s). I plan on synchronizing those at a later date.

Grad-school Timeline Seasonality

WSU’s Okta system required changing passwords every 6 months, and some time after my defense my account died. I am thankful that I had a Thunderbird profile tucked away on a drive that allowed me to recover all of my university emails.

Grad-school and onward Histogram

The year-over-year histogram demonstrates consistent academic seasonality, with September-April peaks and May – mid-August valleys across all years of graduate study. Even with teaching summer labs, the bureaucratic pressure in the summertime dies. I loved teaching in the summer.

Spam, Marketing, and Surveys

The spam timeline shows minimal marketing emails pre-2010, followed by a sharp increase around university enrollment. By 2015, spam reached 60-80% of all emails and has remained consistently high. The GDPR implementation around 2018 created a spike in unsubscribe language as companies scrambled to comply with new regulations.

Marketing Spam Trends

The tail in the beginning of this timeline is presented for context. It only includes a “purified” hotmail account mailbox from my teenage years that extended a bit into my undergrad years. Those years overlap with Gmail usage (not integrated into this data) and my GVSU university email.

Keyword Buckets

Another useful filter for spam emails is checking for keywords like unsubscribe in the message body.

unsubscribe_bait dominates with over 12,500 matches, followed by satisfaction surveys (~8k) and direct “survey” requests (~4k). This reveals how modern marketing shifted from direct sales to engagement-focused tactics requesting feedback and reviews.

Conclusion: Satisfaction Surveys are the new email cancer

The heatmap (filtered to post-2010) shows “satisfaction” spam as the most persistent threat, maintaining 20-25% frequency from 2012 onwards. Survey-based spam shows steady growth, intensifying after 2020, when both GDPR constraints pressured companies to invent new angles of attack, becoming increasingly desperate for customer “feedback” (attention) during the pandemic. Satisfaction feedback surveys are advertisements.

The Flavor Network

Wed, 04 Jan 2023 04:04:49 -0500

dynamics

zoom lock

This tool allows you to explore the flavor network, a social graph for flavor profiles. The network is based on the Flavor Bible and soon the companion book What to Drink with What You Eat.

Search for an ingredient you like, and the graph will refine to give you a web of ingredients that share highly similar flavor profiles. Then, click on a new ingredient in the network to add it to your recipe above the search box (or to remove it). Clicking on a recipe item or a node has the same effect. Search is not sorted by the flavor metric, it is instead sorted lexically.

In this way, you can start building out recipes, menu items and tastings from a consensus of flavor combinations.

Overview

What you are seeing:

the nodes with color are your recipe ingredients
the suggested ingredients are determined by Jaccard similarity (default) or by one of the other options in the ‘lens’ dropdown
if you choose the hybrid option, the suggested ingredients are fiducially split between:
- the most similar ingredients in the flavor metric (similarity)
- the most similar ingredients by text ranking (affinity)
the edges from one ingredient to another are weighted by a consensus of chef and expert opinion ¹
if your ingredient is missing, it was likely missing in the book (quinoa) or was pruned because its mentions were too sparse ²

If a node is present without an edge, it means that the ingredient has a very good similarity with your recipe, but wasn’t mentioned (connected) in its book-entry literally. Reconstructing ‘ghost’ entries and connections by training a model with listed affinities is one of the ultimate goals of this project.

The amount of suggestions gradually decreases as you add more ingredients to your recipe. This is for performance reasons, as with the physics simulation disabling itself at destabilization. When that happens, maybe you discovered a flavor affinity.

I have included autogenerated links from your recipe basket to a few popular recipe resources above the network graph, including a database for cocktail mixing. ³

Why and how

Understanding why I chose this text for the dataset is probably already apparent to its readers, but the key thing to take away is that the authors did a fine job formatting something computer readable and human usable–a rare feat! Most importantly, it is aggregated from chefs, from real humans in kitchens doing what works, what’s delicious, and what’s in season. Recipe API’s don’t have this kind of granularity, many rely too heavily on user data to seed recommendations. To my knowledge, this is the only dataset of this kind.

Technical tools only involve vis.js for visualization and BeautifulSoup for parsing. The data is scraped from the Flavor Bible, and the similarity matrix is calculated using Jaccard similarity for pairwise comparisons. I am working on cleaning up the initial data with some mix of modern techniques with some concoction of nltk, fuzzywuzzy and/or Bert. The current form was done entirely with regexp/bs4 parsing. The suggested nodes can be improved by using a weighted Jaccard probability distribution (arXiv). Source code for this part of the calculation (the text → dataset chain) is available on GitHub.

Inspiration

In 2019, I was helping fellow chefs come up with new specials. At this point in time, we were rolling about four-ten new specials as a team every week, ranging from brunch, cocktails, lunch, football apps and our highly anticipated farm-to-fork pop-up dinners. But sometimes you just get plain stuck. A good trick, at least for creativity, is to set rules so you have some boundaries to push. But if you are going to set rules, they should at least solve a few things:

do something new
use something old
feature three things in season

I hate having extra stuff around, but I love new stuff coming in, yet I don’t like wasting things, but then I actually look forward to doing inventory. Ah, Schrodinger’s cook.

Specials: we would work out new ideas together over the prep table. Sometimes ideas required working things out on paper, usually butcher’s, and occoasionally crude graphs of our plate setups evolved. These were sketches of sauce and protein layouts, heavy edges between ideas if their pairing ‘sang’, then as a guide hanging from the ticket rail on the night a feature debuted.

Karen and Andrew’s book was gifted to me later that year, and it changed my game. It finally put in words a mental ranking of flavor profiles based on ingredient query. I had a good resource that gave answers, and especially new ideas, quickly. And it was thorough enough to trust.

This method was so helpful, I started dreaming of a computer tool to help me sketch out this process. I remembered a reddit post that spurred others to lay out some of the underlying ideas here: overlapping communities :: compatible flavors.

Broader thoughts

I believe the impact of mathematical concepts to the broader culinary scope to be a major upgrade in our thoughtfulness about food. To extend its application, in creativity and clarity, not abused in statistics to pressure a sale and disable the creative mind. While I do see how a tool like this could provide immense practical application in the distribution world, my focus here is to empower chefs, bartenders, brewers, baristas, and sommeliers to create new things.

When it comes to tools available to chefs, compared to musicians, writers, and artists, chef’s are unfortunately at a disadvantage creatively. Yes, we have recipes, but those are instructions, and do little to help us build on ratio or balance. What might be more helpful, I think, is a playground for putting new food ideas together.

In the book, the weight of the pairing is given by the emphasis of the text:
- normal text means mentioned by at least one expert
- bold is recommended by many experts
- BOLD CAPS is highly recommended
- *BOLD CAPS is the “Holy Grail” of pairings
If the ingredient is not mentioned, it is given no weight (or edge) but it does not mean a flavor pairing doesn’t exist. This is part of the purpose of this tool! Lastly, there are a few dozen mentions of “Avoid”, and should be thought of as opposite charges. ↩︎
If you encounter a bug, please feel free to contact me by email or open an issue on GitHub! ↩︎
When What to Drink has been parsed and merged with the network, the latter link in the recipe site list should become much more robust. How fun! ↩︎

Les Miserables

Sat, 24 Dec 2022 05:30:47 -0500

Les Miserables is one of my favorite books. I read most of the original translation on a train ride to Portland, OR from Chicago, IL back in 2008 and enjoyed the remainder on the return trip back East. It taught me compassion: when Valjean places the coin in Cosette’s shoe. Father Christmas always misses her. There was an earlier passage of a man stepping on a coin in front of her, while she swept dressed in rags.

The graph may take a moment to load.

The search bar is the major addition to the graphing methods. Nodes can be clicked and added to a subgraph builder. You can continue to search for new node members in the search bar (which has a rudimentary autofill that’s a straight json query) and clicking on them will add them to the builder. Simultaneously, the graph will reduce to a graph containing only all nodes with edges linked to nodes in the builder.

Items can be removed from the builder either by clicking the little builder tabs or re-clicking the node. Clearing the builder bar completely will redraw the whole graph.

Testing and development was done on the mini pesto data set I made for What is Pesto?. Recipe builder coming soon(!)

Please email me at wyatt@brege.org with any questions.

Dataset can be found here:

Lingering annoyances:

Slow

Javascript needs clean up

I have great fear running this on my 700x3000 dataset..

Network Graphs with Images

Wed, 21 Dec 2022 02:15:04 -0500

This is a follow-up to the previous post Network Graphs in Hugo. I’m feeling fruity. These aren’t all tree fruits, but a few clusters organized by tree grafting compatibility.

Data for the network is stored in two separate JSON files in this page bundle:
- nodes.json
- edges.json

The shortcode and post-local javascript work together:

fruit-network.html

fruit-network.js

{{ $nodesPath := .Get "nodesPath" }}
{{ $edgesPath := .Get "edgesPath" }}

<style>
  #mynetwork {
    background-color: #f5f5f5; /* a medium gray color */
    border-radius: 10px;
    border: 1px solid #cccccc;
    margin: 5px 0 40px 0;
  }
style>

<div id="mynetwork" data-nodes-path={{ $nodesPath }} data-edges-path={{ $edgesPath }}>div>

<script src="https://visjs.github.io/vis-network/standalone/umd/vis-network.min.js">script>
<script src="https://code.jquery.com/jquery-3.6.0.min.js">script>
<script src="fruit-network.js">script>

This will provide network graph physics where the nodes are images (all sourced from Wikipedia. Hugo template for completeness:

{{< fruit-network nodesPath="nodes.json" edgesPath="edges.json" >}}

Network Graphs in Hugo

Fri, 09 Dec 2022 23:02:42 -0500

This is a simple toy to see how a network graph can be added in a Hugo article. I’ll be testing new features on it as I learn new things.

Relative to the root of the Hugo website directory, here’s some basic files to make this interactive. Note that The JSON data and CSS is added inline here to make the scope of this tutorial focus on Hugo-specific structures.

The javascript file lives in this page bundle:
- toy-network.js
This file accesses data for the nodes and edges from two JSON files in this page bundle:
- nodes.json
- edges.json

In the shortcodes directory /layouts/shortcodes/:

toy-network.html

<div id="mynetwork" data-nodes-path="nodes.json" data-edges-path="edges.json">
    <script src="https://visjs.github.io/vis-network/standalone/umd/vis-network.min.js">script>
    <script src="toy-network.js">script>
div>

Do the normal way of making a post in Hugo, but invoke the shortcode within the body of your markdown:
- index.md
```
{{< toy-network nodesPath="nodes.json" edgesPath="edges.json" >}}
```

This will provide the simple network graph above.

Hockey Catch-all Statistics versus Salary Cap

Tue, 07 Nov 2017 11:11:52 -0800

This project ¹ is motivated by the “WAR” stat in baseball, where I have adopted the “Goals vs. Threshold” (GVT) statistic from Tom Awad. Here, I only consider the Offensive GVT for forward skaters and defensemen (OGVT).

I take as input the spreadsheet provided by Robert Vollman, which has not been updated with GVT data yet. I made minor modifications to his spreadsheet in LibreOffice Calc to make it export to the CSV file format well. The code calculates OGVT by player, which is weighted against his own team’s Threshold Offensive Contribution by forwards ($TOC_F$), or defensemen ($TOC_D$), per minute, rather than league wide.

To get an estimate of how good a goal is compared to an assist, we estimate that a goal scored contributes 1.5 times as much as an assist contributes to a goal. Therefore, the calculated goal value (or assist) scored by an entity $x$ is $$ \begin{aligned} GV_x &= \frac{1.5 G_x}{A_x + 1.5 G_x}, \\ AV_x &= \frac{GV_x}{1.5} \end{aligned} $$ where $G_x$ is goals scored by either an individual, $x=i$, team, $x=T$, or the league as a whole, $x=L$, and $A_x$ are the assists scored by those subcategories.

The total offensive contribution of all forwards, $TOC_F$, is determined by

$$ TOC_F = \frac{\sum_{f \in T} G_f \times GV_T + A_f \times AV_T}{\sum_{f \in T} MP_f} \times OTV$$ where $MP_f$ is the minutes by forward, and the offensive threshold value is $OTV = 0.75$ via Tom Awad or $0.58$ via Alan Ryder (I chose the former). I chose an uppercase $F$ so that one may distinguish this value, which applies to all forwards on the team, from an individual forward, $f$.

The final formula to calculate $OGVT$ for each forward $f$ is, according to Awad, then $$ OGVT = G_f \times GV_f + A_f \times AV_f - MP_f \times TOC_F $$

Additionally, I wanted to get a sense for one player’s value to the team in relation to his salary cap hit. Here, I show from the 2016-17 NHL regular season $OGVT$ versus Salary Cap for the Stanley Cup Champion Pittsburgh Penguins, the cap-troubled Detroit Red Wings, and the young Edmonton Oilers with generational talent Connor McDavid (only forward skaters).

However, in debugging my code, something seemed strange to me. This first term in the $OGVT$ expression, with some math, reduces to the number of goals by that individual: $$ \begin{aligned} G_f \times GV_f + A_f \times AV_f &= G_f \times GV_f + A_f \times \frac{GV_f}{1.5} \\ &= \left( G_f + \frac{A_f}{1.5}\right ) \times GV_f \\ &= \left( G_f + \frac{A_f}{1.5}\right ) \times \left( \frac{1.5 G_f}{A_f + 1.5 G_f} \right) \\ &= \left( 1.5 G_f + A_f \right) \times \left( \frac{G_f}{A_f + 1.5 G_f} \right) \\ &= G_f. \end{aligned} $$ So, unless I’m misunderstanding Tom Awad’s definition of terms here:

A player’s OGVT is therefore:

OGVT = (G x GV) + (A x AV) - (MP x TOC)

Where G is the player’s goals, A his assists, MP his minutes played, GV his goal value, AV his assist value, and TOC the Threshold offensive contribution value for his position.

I don’t quite understand how this first set of terms is relevant, as it essentially removes the direct value of a skater’s assists in the calculation of this catch-all offensive statistic.

Actually, I mostly wanted to get some experience with D3 and using publically accesible data. I’m still investigating why the axes titles aren’t showing on my plot. ↩︎