<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://www.testingbranch.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.testingbranch.com/" rel="alternate" type="text/html" /><updated>2026-02-08T11:04:20+00:00</updated><id>https://www.testingbranch.com/feed.xml</id><title type="html">Testing Branch</title><subtitle>Explorations in machine learning, simulation, and data modeling — practical notebooks and experiments.</subtitle><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;nil, &quot;email&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Email&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-envelope-square&quot;, &quot;url&quot;=&gt;&quot;mailto:miguelcbatista@gmail.com&quot;}, {&quot;label&quot;=&gt;&quot;Twitter&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-twitter-square&quot;, &quot;url&quot;=&gt;&quot;https://twitter.com/mpcbatista&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;}, {&quot;label&quot;=&gt;&quot;Linkedin&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;}]}</name></author><entry><title type="html">Re-Identification vs Anonymization Strength</title><link href="https://www.testingbranch.com/re_identification/" rel="alternate" type="text/html" title="Re-Identification vs Anonymization Strength" /><published>2026-02-08T00:00:00+00:00</published><updated>2026-02-08T00:00:00+00:00</updated><id>https://www.testingbranch.com/re_identification</id><content type="html" xml:base="https://www.testingbranch.com/re_identification/"><![CDATA[<p>Code: <a href="https://github.com/mpcsb/reidentification">github.com/mpcsb/reidentification</a></p>

<h2 id="re-identification-risk-vs-k-anonymity-an-experimental-walkthrough">Re-Identification Risk vs k-Anonymity: An Experimental Walkthrough</h2>

<p>Most discussions of anonymization focus on buzzwords like <strong>k-anonymity</strong> and <strong>differential privacy</strong>, but few dig into what actually happens to a dataset as anonymity strength increases.</p>

<p>In this post, we conduct a full experimental walkthrough to quantify how raising the k-anonymity level impacts both <strong>privacy</strong> (re-identification risk) and <strong>data utility</strong>.</p>

<p>We simulate an attacker with partial knowledge trying to re-identify individuals, and we measure how data quality degrades as we ramp up the anonymization.<br />
The goal is to illuminate where the balance lies between keeping data useful and keeping individuals anonymous.</p>

<hr />

<h2 id="data-generation-and-anonymization-setup">Data Generation and Anonymization Setup</h2>

<p>For our experiment, we generated a synthetic dataset of <strong>2000 individuals</strong>, each with the following fields:</p>

<table>
  <thead>
    <tr>
      <th>Field</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">age</code></td>
      <td>Numerical age (used as a quasi-identifier)</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">zip3</code></td>
      <td>3-digit ZIP code prefix (regional location, quasi-identifier)</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">sex</code></td>
      <td>Binary sex attribute (quasi-identifier)</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">lab_glucose</code></td>
      <td>Continuous lab glucose level (a target variable <em>not</em> used in anonymization)</td>
    </tr>
  </tbody>
</table>

<p>We treat <strong>age</strong>, <strong>zip3</strong>, and <strong>sex</strong> as the quasi-identifiers (QIs) that will be subject to anonymization.</p>

<p>The value of <strong>k</strong> in k-anonymity was varied from <strong>1 to 20</strong>.<br />
A k-anonymity requirement means each record must be indistinguishable from at least <strong>k–1 others</strong> with respect to these QIs.</p>

<p>To achieve this, an anonymization routine <strong>groups and generalizes</strong> records until every combination of QIs occurs in at least k records.<br />
In practical terms, as k increases, the algorithm must increasingly <strong>generalize (coarsen)</strong> or <strong>suppress</strong> details in the QIs to satisfy the larger group size.</p>

<hr />

<h3 id="parameters-explored">Parameters explored</h3>

<p>We explored a few key parameters that control how the data is generalized:</p>

<ul>
  <li><strong>Age bin width:</strong> We varied age grouping from 1-year bins (no grouping beyond integer ages) up to 10-year bins. Larger bin widths mean ages get lumped into broader ranges (e.g., 30–39).</li>
  <li><strong>Top-coding of age:</strong> Extreme ages were top-coded above a threshold (e.g., all ages 75 and above recorded as <code class="language-plaintext highlighter-rouge">"75+"</code>). This prevents very old ages from standing out.</li>
  <li><strong>Rare ZIP suppression:</strong> Low-frequency ZIP3 regions were grouped into an <code class="language-plaintext highlighter-rouge">"Other"</code> category once their count fell below a threshold. If a region is too unique, it gets collapsed to hide outliers.</li>
</ul>

<p>By adjusting these knobs, we impose different anonymization strategies.<br />
For each value of <strong>k</strong> (and each combination of binning/top-coding settings), we produced an anonymized version of the dataset and evaluated how much information was lost in the process.</p>

<h2 id="attacker-context-partial-knowledge-threat-model">Attacker Context: Partial Knowledge Threat Model</h2>

<p>Anonymization is only meaningful relative to an attacker’s knowledge.<br />
In our scenario, we simulate an attacker who has <strong>partial information</strong> about individuals — specifically, the attacker knows an individual’s:</p>

<ul>
  <li>age</li>
  <li>general location (ZIP3 region)</li>
  <li>sex</li>
</ul>

<p>(e.g., from a data leak or public records).</p>

<p>This is a common threat model for re-identification:<br />
an adversary might obtain someone’s demographic details from a breached source and then try to find that person in an anonymized dataset (such as medical or survey data) released publicly.</p>

<p>The attacker’s goal is <strong>re-identification</strong>:<br />
to match each anonymized record to the corresponding real individual by comparing the quasi-identifiers.</p>

<p>Importantly, the attacker does <strong>not</strong> know the sensitive value (<code class="language-plaintext highlighter-rouge">lab_glucose</code>) in our case;<br />
they only leverage the QIs that are also present (albeit generalized) in the anonymized data.</p>

<p>This kind of attack is known as a <strong>record linkage attack</strong>, using the assumption that if an anonymized entry shares a unique combination of age, region, and sex with a known individual’s data, they are likely the same person.</p>

<p>This threat model underscores why k-anonymity focuses on QIs:<br />
even innocuous-seeming attributes like age and ZIP code can triangulate someone’s identity when combined.</p>

<p>Next, we describe how our simulated attacker performs the re-identification.</p>

<hr />

<h2 id="attackers-re-identification-strategy-global-matching">Attacker’s Re-identification Strategy (Global Matching)</h2>

<p>How does our attacker try to re-identify records?</p>

<p>Instead of using a simple greedy matching (checking each anonymized record independently),<br />
we implement a <strong>global optimization strategy</strong>.</p>

<p>We use a <strong>bipartite assignment solver</strong> (Google OR-Tools’ linear sum assignment solver) to find the optimal one-to-one matching between anonymized records and original records that best aligns their attributes.</p>

<h3 id="cost-based-matching">Cost-based Matching</h3>

<p>We define a cost for matching an anonymized record <em>aᵢ</em> with an original record <em>oⱼ</em> based on their differences in quasi-identifiers:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cost(i, j) = d_age(a_i, o_j) + d_zip(a_i, o_j) + d_sex(a_i, o_j)
</code></pre></div></div>

<p>Each <strong>d</strong> term is a distance measure for that attribute.</p>

<ul>
  <li>If an anonymized age is a range (due to binning) and the original age falls within that range, the age distance may be zero; if it falls outside, the distance increases.</li>
  <li>If an anonymized ZIP3 was generalized to <code class="language-plaintext highlighter-rouge">"Other"</code>, any specific ZIP from the original will incur a cost when compared to <code class="language-plaintext highlighter-rouge">"Other"</code>.</li>
</ul>

<p>These distances capture how well an original record fits the generalized form of an anonymized record.<br />
Lower cost → the two records are more similar across QIs.</p>

<p>We then solve for the assignment <strong>π(i)</strong> that minimizes the total cost of matching all anonymized records to distinct original records:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>min_π  Σ_i  cost(i, π(i))
subject to: each original record is matched at most once
</code></pre></div></div>

<p>This optimization finds the <strong>best overall matching</strong> between the two datasets.</p>

<p>By considering all records jointly, the attacker avoids making locally optimal but globally inconsistent matches.</p>

<p>Even if each anonymized record individually has multiple plausible matches, the solver finds a <strong>globally consistent</strong> assignment.<br />
The outcome is an assignment pairing most anonymized records with specific original record guesses.</p>

<hr />

<h3 id="measuring-re-identification-success-hit1">Measuring Re-identification Success: Hit@1</h3>

<p>To evaluate the attack, we use the <strong>Hit@k</strong> metric common in information retrieval.</p>

<p>A “hit” means the correct original record appears within the attacker’s <strong>top k</strong> guesses for an anonymized record.</p>

<p>In our case:</p>

<ul>
  <li>the solver produces <strong>one</strong> best match per anonymized record<br />
→ effectively Hit@1 only</li>
</ul>

<p>So we focus on <strong>Hit@1</strong>, the fraction of anonymized records where the attacker’s top guess is correct.</p>

<p>A Hit@1 of <strong>50%</strong> means the attacker correctly re-identified half of the individuals on the first guess.</p>

<p>(Hit@5 would allow up to 5 guesses per record, but we stick with the strictest measure.)</p>

<p>With the attack strategy and success metric defined, we now examine how re-identification risk and data utility change as anonymization strength increases.</p>

<h2 id="results">Results</h2>

<h2 id="re-identification-success-vs-anonymity-level">Re-identification Success vs. Anonymity Level</h2>

<p>We first examine how the attacker’s success rate (<strong>Hit@1</strong>) changes as the anonymity parameter <strong>k</strong> increases.<br />
Intuitively, higher k (stronger anonymity) should make re-identification harder.</p>

<p>Our experiments confirmed this:<br />
<strong>the attacker’s success drops dramatically as k grows.</strong></p>

<hr />
<p><img src="/assets/images/re_identification/heatmap_mean_hit_rate_rare_0.png" alt="Hit@1 heatmap by ZIP rarity and age bin" /></p>

<p><img src="/assets/images/re_identification/heatmap_mean_hit_rate_by_zip_rarity_age1.png" alt="A heatmap showing the attacker’s Hit@1 (darker means lower success) for various anonymity settings." /></p>

<p>Each cell is the average Hit@1 across trials for a given combination of k (y-axis) and age bin width (x-axis). Success rates plummet as k increases. Notably, there is a sharp drop in attacker success once k is around 5–7, indicating the onset of strong anonymity where the attack loses traction.*</p>

<hr />

<p>Even at <strong>k = 1</strong> (minimal anonymization), the attacker does <strong>not</strong> get a 100% hit rate.<br />
The maximum Hit@1 observed hovered just above <strong>~50%</strong>.</p>

<p>This is because even in the <strong>raw data</strong>, some individuals share the same QI values<br />
(e.g., multiple people with the same age, ZIP3, and sex),<br />
so they cannot all be uniquely identified by QIs alone.<br />
This sets an <strong>upper ceiling</strong> on re-identification success.</p>

<p>As k increases from <strong>1 to 5</strong>, Hit@1 falls gradually.<br />
Beyond <strong>k ~ 5–7</strong>, it <strong>plummets sharply</strong>.</p>

<p>By <strong>k ≥ 10</strong>, the attacker’s top-guess accuracy is very low<br />
(approaching random chance in many settings).</p>

<p><strong>Summary:</strong> raising k dramatically improves privacy, especially after the mid-range threshold where anonymity “kicks in.”</p>

<hr />

<h2 id="data-utility-loss-as-k-increases">Data Utility Loss as k Increases</h2>

<p>Stronger anonymization comes at the cost of <strong>data utility</strong>.</p>

<p>We tracked several metrics to quantify how the dataset’s analytical value degrades as k increases:</p>

<ul>
  <li>
    <p><strong>ZIP Utility:</strong><br />
Measures how well the distribution of ZIP3 values is preserved.<br />
Defined between 0–1, where 1.0 means the anonymized ZIP distribution exactly matches the original.<br />
(Computed as (1 - \frac{1}{2} L_1) distance.)</p>
  </li>
  <li>
    <p><strong>Mean Age Drift:</strong><br />
The difference in the average age between anonymized and original data.<br />
Captures how much anonymization distorts age information.</p>
  </li>
  <li>
    <p><strong>Mean Glucose:</strong><br />
A sanity check for a <strong>non-QI variable</strong> that should remain unchanged.</p>
  </li>
</ul>

<hr />

<p>Two of these metrics, <strong>ZIP Utility</strong> and <strong>Age Drift</strong>, clearly illustrate the non-linear loss of detail as k grows.</p>

<hr />

<p><img src="/assets/images/re_identification/zip_utility_vs_k.png" alt="*ZIP Utility (y-axis) versus anonymity level k (x-axis)." /></p>

<p>The line shows that as k increases, the ZIP code distribution retains less and less of its original detail.<br />
Once k exceeds about 8, we see a notable drop in ZIP Utility.<br />
At k = 16, roughly 25–30% of the geographic granularity is lost.*</p>

<hr />

<p><img src="/assets/images/re_identification/age_drift_vs_k.png" alt="*Mean Age Drift (in years) as a function of k. " /></p>

<p>A negative drift means the anonymized data’s average age is lower than the original.<br />
At high anonymity levels (k ≈ 20), the mean age is about 3 years lower.<br />
Top-coding and heavy binning compress the age distribution toward the middle.*</p>

<hr />

<p>Reassuringly, <strong>Mean Glucose</strong> remained essentially unchanged across all k values (drift ~0).<br />
This confirms that the anonymization procedure targeted only QIs (age, zip, sex) and did not distort unrelated attributes.</p>

<p>Overall:</p>

<ul>
  <li>For small increases in <strong>k (1–5)</strong>, utility remains close to original fidelity.</li>
  <li>Beyond <strong>k ≈ 5–8</strong>, generalization becomes aggressive and utility drops sharply.</li>
</ul>

<p>This suggests a <strong>“sweet spot”</strong> where privacy improves significantly while preserving substantial utility, after which additional privacy becomes expensive in terms of information loss.</p>

<h2 id="the-privacyutility-frontier">The Privacy–Utility Frontier</h2>

<p>It is helpful to visualize the inherent trade-off between privacy and utility.</p>

<p>Each anonymization configuration we tested<br />
(a specific combination of <strong>k</strong> and generalization parameters)<br />
can be thought of as a single point in a two-dimensional space:</p>

<ul>
  <li>one axis = <strong>privacy outcome</strong> (e.g., Hit@1 re-identification success)</li>
  <li>the other axis = <strong>utility outcome</strong> (e.g., how many candidate matches remain / how much detail is preserved)</li>
</ul>

<p>Plotting all configurations reveals a clear <strong>privacy–utility frontier</strong>.</p>

<hr />

<p>Each point represents one anonymization scenario (specific k and parameter settings), plotted by its resulting privacy risk (y-axis: Hit@1 success rate) and a utility indicator (x-axis: number of plausible candidate matches per anonymized record, which correlates with retained information).</p>

<p><img src="/assets/images/re_identification/privacy_utility_frontier_age_compare.png" alt="Privacy–utility frontier (age)" /></p>

<p><img src="/assets/images/re_identification/privacy_utility_frontier_zip_compare.png" alt="Privacy–utility frontier (ZIP)" /></p>

<p>The plot forms a <strong>downward-sloping curve</strong>.</p>

<ul>
  <li>Configurations with <strong>lower re-identification risk</strong> invariably have <strong>lower data utility</strong>.</li>
  <li>The initial part of the curve is <strong>steep</strong> — meaning you can reduce risk significantly with only a small drop in utility.</li>
  <li>The later part of the curve <strong>flattens</strong> — meaning achieving tiny extra privacy gains requires <strong>large utility sacrifices</strong>.</li>
</ul>

<p>In simpler terms:</p>

<blockquote>
  <p>You can’t have it all.<br />
Past a certain point, making the data “very anonymous” makes it statistically or analytically blurry.</p>
</blockquote>

<p>The scatter shows every dataset version lies somewhere on this curve.<br />
Deciding where to operate is a <strong>policy choice</strong>:</p>

<ul>
  <li><strong>low k</strong> → high utility, low privacy</li>
  <li><strong>high k</strong> → high privacy, low utility</li>
</ul>

<h2 id="conclusion-and-discussion">Conclusion and Discussion</h2>

<p>Our empirical exploration highlights how increasing <strong>k-anonymity</strong> leads to <strong>diminishing returns</strong>.</p>

<p>For <strong>modest anonymity levels</strong> (up to around k = 5):</p>

<ul>
  <li>Each increment in k yields a <strong>big drop</strong> in re-identification risk.</li>
  <li>The corresponding hit to data utility is <strong>mild</strong>.</li>
</ul>

<p>Beyond that, however, the trade-off worsens:</p>

<ul>
  <li>Pushing k higher gives <strong>smaller and smaller privacy benefits</strong>.</li>
  <li>Meanwhile, it <strong>rapidly erodes</strong> the granularity and usefulness of the data.</li>
</ul>

<p>This is essentially a manifestation of a <strong>Pareto frontier</strong> —<br />
there comes a point where you must give up <strong>a lot</strong> of utility to get <strong>a little</strong> more privacy.</p>

<hr />

<h3 id="attribute-sensitivity">Attribute Sensitivity</h3>

<p>Different data attributes showed <strong>different sensitivity</strong> to anonymization:</p>

<ul>
  <li>
    <p><strong>Geographic detail (ZIP3)</strong> degraded <em>first</em>.<br />
Many ZIPs are rare → must be collapsed to <code class="language-plaintext highlighter-rouge">"Other"</code> as k grows.</p>
  </li>
  <li>
    <p><strong>Age</strong> was more resilient but eventually smoothed by<br />
<strong>wide bins</strong> and <strong>top-coding</strong> → resulting in shifts such as<br />
a <strong>3-year drop</strong> in average age at high k.</p>
  </li>
  <li>
    <p><strong>lab_glucose</strong> remained unchanged.<br />
Since glucose was <em>not</em> part of the QIs, anonymization preserved it.<br />
This demonstrates that non-identifying variables can remain intact<br />
even as identifying information is stripped away.</p>
  </li>
</ul>

<p>This attribute-by-attribute difference shows that <strong>utility loss is domain-specific</strong>.<br />
Some features lose meaning faster than others under anonymization.</p>

<hr />

<h3 id="about-the-attacker-model">About the Attacker Model</h3>

<p>It is also worth noting that our attack model was relatively <strong>basic</strong>.</p>

<p>We assumed the attacker only knows:</p>

<ul>
  <li>age</li>
  <li>sex</li>
  <li>region (ZIP3)</li>
</ul>

<p>And they use a <strong>straightforward optimal matching algorithm</strong>.</p>

<p>A more determined adversary might:</p>

<ul>
  <li>have <strong>additional clues</strong> (e.g., approximate health measurements)</li>
  <li>access <strong>multiple leaks</strong></li>
  <li>use <strong>statistical models</strong> to narrow matches</li>
  <li>use Bayesian linkage, ML-based scoring, or constraint solvers</li>
</ul>

<p>Such an attacker could defeat k-anonymity more often.<br />
Therefore the Hit@1 rates in our experiment may be <strong>optimistic</strong>.<br />
Real-world re-identification risk could be <strong>higher</strong>.</p>

<p>This highlights that anonymization should <strong>not</strong> be:</p>

<blockquote>
  <p>a one-time, set-and-forget protection mechanism.</p>
</blockquote>

<p>You must consider <strong>evolving threat models</strong><br />
and possibly combine k-anonymity with other techniques:</p>

<ul>
  <li>noise addition</li>
  <li>perturbation</li>
  <li>differential privacy</li>
  <li>synthetic data generation</li>
  <li>secure linkage systems</li>
</ul>

<hr />

<h3 id="finding-the-balance">Finding the Balance</h3>

<p>Ultimately, choosing an anonymization level is about <strong>balancing privacy risk against data usability</strong>.</p>

<p>Our experiment puts <strong>concrete numbers</strong> on that balance:</p>

<ul>
  <li>The initial drop in re-id risk (as k rises from 1→5) is <strong>encouraging</strong>.</li>
  <li>It means we can significantly protect identities <strong>without</strong> immediately destroying utility.</li>
  <li>But the flattening of the curve at higher k reminds us that<br />
<strong>aggressive anonymization</strong> yields minimal extra privacy at <strong>huge cost</strong>.</li>
</ul>

<p>Decision-makers should consider what level of risk is acceptable<br />
given the <strong>purpose</strong> of the data.</p>

<p>For many cases:</p>

<ul>
  <li>
    <p><strong>Moderate k</strong> (enough to prevent easy pinpointing of individuals)<br />
is <strong>sufficient</strong> and maintains usefulness.</p>
  </li>
  <li>
    <p><strong>High k</strong> may make the dataset <strong>practically unusable</strong>.</p>
  </li>
</ul>

<hr />

<h2 id="key-takeaways">Key Takeaways</h2>

<ul>
  <li>
    <p><strong>k-anonymity trades precision for privacy</strong>.<br />
Generalization and suppression remove detail from QIs.</p>
  </li>
  <li>
    <p><strong>Privacy gains are strong at first, then plateau</strong>.<br />
Beyond mid-range k, utility collapses faster than privacy improves.</p>
  </li>
  <li>
    <p><strong>Utility loss is nonlinear and varies by attribute</strong>.<br />
Sparse attributes like ZIP lose meaning earlier.</p>
  </li>
  <li>
    <p><strong>Non-identifying attributes can remain intact</strong>.<br />
Good for preserving analytical value.</p>
  </li>
  <li>
    <p><strong>Past moderate k, returns diminish greatly</strong>.<br />
More anonymity → minimal privacy gain, major utility loss.</p>
  </li>
</ul>

<p>In conclusion:<br />
Effective anonymization is about finding the <strong>balance</strong>:<br />
protecting individuals without rendering data barren for analysis.<br />
Our findings illustrate that balance clearly for this dataset under different settings.</p>]]></content><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;nil, &quot;email&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Email&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-envelope-square&quot;, &quot;url&quot;=&gt;&quot;mailto:miguelcbatista@gmail.com&quot;}, {&quot;label&quot;=&gt;&quot;Twitter&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-twitter-square&quot;, &quot;url&quot;=&gt;&quot;https://twitter.com/mpcbatista&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;}, {&quot;label&quot;=&gt;&quot;Linkedin&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;}]}</name></author><category term="anonymization" /><category term="privacy" /><category term="optimization" /><summary type="html"><![CDATA[2026-02-08 — Exploring how increasing k-anonymity affects data utility and the attacker’s ability to re-identify records.]]></summary></entry><entry><title type="html">Using geometry to choose embeddings</title><link href="https://www.testingbranch.com/embedding-quality/" rel="alternate" type="text/html" title="Using geometry to choose embeddings" /><published>2025-11-11T00:00:00+00:00</published><updated>2025-11-11T00:00:00+00:00</updated><id>https://www.testingbranch.com/embedding-quality</id><content type="html" xml:base="https://www.testingbranch.com/embedding-quality/"><![CDATA[<p>Code: <a href="https://github.com/mpcsb/tb-embedding-quality">github.com/mpcsb/tb-embedding-quality</a></p>

<h2 id="why-this-matters">Why this matters</h2>

<p>We tend to treat <strong>cosine distance</strong> as if it were a true metric.<br />
Cosine does <strong>not</strong> satisfy triangle inequality: explicitly proven <strong><a href="https://arxiv.org/abs/2107.04071">here</a></strong>.</p>

<p>Metric indexes <em>do</em> rely on triangle inequality to prune search space, shown in<br />
<strong><a href="https://homes.cs.aau.dk/~csj/Papers/Files/2015_ChenICDE.pdf">Efficient Metric Indexing for Similarity Search</a></strong>.</p>

<blockquote>
  <p>Metric indexing relies on triangle inequality for pruning.</p>
</blockquote>

<p>HNSW / FAISS don’t require strict metric axioms, but they <strong>assume neighborhood consistency</strong>:
if a point is “closer,” greedy search expects it to lead to even closer points.</p>

<p>That assumption only holds when the embedding space has stable geometry (i.e. distances behave consistently like in a true metric).</p>

<h3 id="what-causes-the-embedding-geometry-to-break-down">What causes the embedding geometry to break down</h3>

<p>Two independent things:</p>

<ol>
  <li>
    <p><strong>Bad corpus / domain mismatch</strong><br />
If the embedding model wasn’t trained on similar text, semantics get scattered.</p>
  </li>
  <li>
    <p><strong>Compression (PCA + quantization)</strong><br />
Removes structure. Local neighborhoods collapse. This is particularly bad because compression solves a lot of operational problems.</p>
  </li>
</ol>

<p>Both lead to the same consequences:</p>

<ul>
  <li>nearest neighbors stop being nearest</li>
  <li>triangle inequality fails locally, and fails harder as distances increase.</li>
  <li>retrieval (the R in RAG) becomes unstable</li>
</ul>

<p>This post measures that directly.</p>

<h2 id="setup">Setup</h2>

<p>We embed two different datasets:</p>

<table>
  <thead>
    <tr>
      <th>Corpus</th>
      <th>What it contains</th>
      <th>Notes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong><a href="https://www.kaggle.com/datasets/vinitshah0110/food-composition">food</a></strong></td>
      <td>short ingredient / composition snippets</td>
      <td>noisy, repetitive text</td>
    </tr>
    <tr>
      <td><strong><a href="https://www.kaggle.com/datasets/matthewjansen/pubmed-200k-rtc">medical</a></strong></td>
      <td>clinical trial abstracts</td>
      <td>dense, clean text</td>
    </tr>
  </tbody>
</table>

<p>Three embedding variants:</p>

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Model</th>
      <th>Dim</th>
      <th>Notes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">A_raw</code></td>
      <td>DistilBERT STS-B</td>
      <td>768</td>
      <td>strong baseline, ‘high’ dimension</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">B_raw</code></td>
      <td>MiniLM-L6-v2</td>
      <td>384</td>
      <td>common model in local/demo RAG systems</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">B_pca64q4</code></td>
      <td>MiniLM → PCA 64 → 4-bit</td>
      <td>64</td>
      <td>aggressive compression</td>
    </tr>
  </tbody>
</table>

<p>Each corpus was chunked into 1000 samples.</p>

<hr />

<h2 id="what-we-measure">What we measure</h2>

<p>For each point <em>i</em>:</p>

<ol>
  <li>find its top-k nearest neighbors (<code class="language-plaintext highlighter-rouge">j</code>)</li>
  <li>check if any neighbor of a neighbor (<code class="language-plaintext highlighter-rouge">k</code>) breaks triangle inequality:<br />
d(i, k) &gt; d(i, j) + d(j, k) + τ</li>
</ol>

<p>If no violation exists, the point is <strong>clean</strong>.</p>

<p>The metric:</p>
<blockquote>
  <p>clean_frac = fraction of points with consistent neighborhoods</p>
</blockquote>

<p>τ = tolerance.<br />
Higher <code class="language-plaintext highlighter-rouge">clean_frac</code>: stable space.<br />
Lower <code class="language-plaintext highlighter-rouge">clean_frac</code>: distances are not reliable.</p>

<p>I use <strong>Z3</strong> only to answer a yes/no question:</p>

<blockquote>
  <p>“Does <em>any</em> violating (j, k) exist for this anchor?”</p>
</blockquote>

<p>Brute-forcing all <code class="language-plaintext highlighter-rouge">(i, j, k)</code> triples did not work well when I tackled this in the past.<br />
Z3 doesn’t brute force. It treats distances as constraints and either:</p>
<ul>
  <li>finds a violating triplet, or</li>
  <li>proves that none exists for that anchor.</li>
</ul>

<h2 id="results">Results</h2>
<p>Three parts:</p>

<h3 id="1-umap-do-embeddings-even-cluster-coherently">1) UMAP: do embeddings even cluster coherently?</h3>

<p><img src="/assets/images/embedding_quality/umap_embeddings.png" alt="umap_embeddings" /></p>

<p>Raw embedding models produce tight clusters separated by corpus, whereas PCA+quantization blurs everything together causing the food and medical corpora to overlap with no separation!</p>

<p>Already looks like geometry degradation.</p>

<hr />

<h3 id="2-heatmap-metric-stability-across-k-neighbors-and-τ-tolerance">2) Heatmap: metric stability across <code class="language-plaintext highlighter-rouge">k</code> neighbors and τ tolerance</h3>
<p>(In the heatmap, the horizontal axis is the neighbor count k and the vertical axis is the tolerance τ. Color indicates the fraction of points that remain clean)</p>

<p><img src="/assets/images/embedding_quality/heatmap.png" alt="heatmap" /></p>

<p>Observations:</p>

<ul>
  <li>medical corpus: <strong>solid geometry</strong> (clean_frac ≈ 1.0)</li>
  <li>food corpus: noisy semantics, which leads to poor geometry</li>
  <li>PCA+quantized: catastrophic collapse</li>
</ul>

<p>Even at τ = 0.1 (a huge forgiveness margin), PCA+quantized still breaks.</p>

<hr />

<h3 id="3-stability-vs-k-how-fast-the-neighborhood-falls-apart">3) Stability vs k: how fast the neighborhood falls apart</h3>

<p><img src="/assets/images/embedding_quality/stability_curves.png" alt="stability_curves" /></p>

<ul>
  <li>raw embeddings degrade slowly as k expands</li>
  <li>compressed embedding collapses by k=10</li>
</ul>

<p>If retrieval expands k during rerank / recall-then-rerank — expect garbage neighbors.</p>

<hr />

<h2 id="key-takeaways">Key takeaways</h2>

<ol>
  <li>
    <p><strong>Embeddings are not guaranteed to form a metric space.</strong><br />
If triangle inequality fails, nearest neighbors may not be the nearest.<br />
Retrieval results may not be ideal.</p>
  </li>
  <li>
    <p><strong>Compression destroys neighborhood structure.</strong><br />
PCA+quantization doesn’t ‘reduce redundancy’. This step needs extra monitoring, as results can degrade <strong>fast</strong>.</p>
  </li>
  <li>
    <p><strong>Weaker/less structured corpus result garbage geometry.</strong><br />
Not surprising.</p>
  </li>
</ol>

<blockquote>
  <p>Choose embeddings based on how well they preserve geometry.</p>
</blockquote>

<hr />

<p>Vector DBs assume a metric space.<br />
Most embedding models don’t always return one.</p>

<p>If the embedding space breaks (wrong model, wrong corpus, or compression) nearest neighbors aren’t nearest and the R in RAG stands for roulette…</p>]]></content><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;nil, &quot;email&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Email&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-envelope-square&quot;, &quot;url&quot;=&gt;&quot;mailto:miguelcbatista@gmail.com&quot;}, {&quot;label&quot;=&gt;&quot;Twitter&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-twitter-square&quot;, &quot;url&quot;=&gt;&quot;https://twitter.com/mpcbatista&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;}, {&quot;label&quot;=&gt;&quot;Linkedin&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;}]}</name></author><category term="embeddings" /><category term="rag" /><category term="machine-learning" /><category term="z3" /><summary type="html"><![CDATA[2025-11-11 — Empirical evaluation of local geometry in vector embeddings across models and corpora.]]></summary></entry><entry><title type="html">Model Equivalence using Z3</title><link href="https://www.testingbranch.com/Z3-and-model-equivalence/" rel="alternate" type="text/html" title="Model Equivalence using Z3" /><published>2025-11-07T00:00:00+00:00</published><updated>2025-11-07T00:00:00+00:00</updated><id>https://www.testingbranch.com/Z3-and-model-equivalence</id><content type="html" xml:base="https://www.testingbranch.com/Z3-and-model-equivalence/"><![CDATA[<p>Code: <a href="https://github.com/mpcsb/tb_model_equivalence">github.com/mpcsb/tb_model_equivalence</a></p>

<hr />

<p>Most model replacement flows stop after <strong>validation accuracy</strong>.</p>

<p>If loss and accuracy remain roughly the same, the task is considered done.<br />
But validation only tells us that <strong>on the samples we checked</strong> the models behave similarly.</p>

<p>It says <strong>nothing about the rest of the input space.</strong></p>

<h2 id="why-this-matters--two-major-use-cases">Why this matters — two major use cases</h2>

<p>There are at least two distinct workflows where this matters:</p>

<h3 id="a-model-pruning--distillation--simplification">A) Model pruning / distillation / simplification</h3>
<p>We modify a model intentionally:</p>
<ul>
  <li>reduce latency</li>
  <li>reduce model size</li>
  <li>simplify the architecture (for interpretability or cost)</li>
</ul>

<p>We want to know if the simplified model really behaves like the original one.</p>

<blockquote>
  <p>Example: Random Forest → Pruned Random Forest (our example in this post)</p>
</blockquote>

<h3 id="b-model-retraining--continuous-integration">B) Model retraining / continuous integration</h3>
<p>A model is re-trained with new data, new hyperparams, or a new architecture.</p>

<p>Before replacing the model in production we need to know:</p>

<p><strong>Is the new model equivalent to the legacy one, or where do they differ? How much and where do they differ?</strong></p>

<p>This turns model replacement into a <strong>regression test</strong>, similar to a CI/CD.
This makes model updates less opaque and gives all the scope we need to understand how it generalizes.</p>

<hr />

<p>Validation tells us <strong>similarity on sampled points</strong>, on information that we already have.<br />
What we want is <strong>equivalence</strong>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>For all inputs x in the domain:
    model_A(x) == model_B(x)
</code></pre></div></div>

<p>If the answer is no, we want to know the exact violating input <em>X</em>.</p>

<h2 id="z3-for-model-equivalence-proving-ml-models-match-or-finding-the-exact-input-where-they-dont">Z3 for Model Equivalence: Proving ML Models Match (or Finding the exact input where they don’t)</h2>

<p><strong>Goal</strong>: Instead of <em>measuring</em> similarity between models, <strong>prove</strong> they’re equivalent, and where they’re not: extract the exact counterexample.</p>

<p><a href="https://en.wikipedia.org/wiki/Z3_Theorem_Prover">Z3</a> is a constraint solver from Microsoft Research.<br />
Optimizers try values and adjust based on results.<br />
Z3 doesn’t search, it doesn’t <em>brute forces</em> the computation, it <strong>reasons</strong> about all possible inputs.<br />
We state the rules, and it determines whether any input satisfies them.</p>
<blockquote>
  <p>For model equivalence, we ask: is there any x where the two models disagree? If yes, Z3 returns that x; if not, it proves none exists.</p>
</blockquote>

<p>The classic use case (or at least how I first heard of z3): <strong><a href="https://ericpony.github.io/z3py-tutorial/guide-examples.htm#sudoku">Sudoku solver</a></strong>.</p>

<p>Every rule is encoded as a constraint: numbers 1–9, unique rows, unique cols, etc.<br />
Z3 doesn’t brute-force — it symbolically prunes entire spaces at once.</p>

<p>We use the same trick here.</p>

<p>Instead of trying inputs until the model fails, we ask Z3:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>∃ x   such that   Model_A(x) ≠ Model_B(x)
</code></pre></div></div>

<p>If yes → Z3 produces that violating x.<br />
If not → Z3 proves no such x exists within the domain.</p>

<hr />

<h2 id="encoding-a-model-as-logic">Encoding a model as logic</h2>

<p>Decision trees are perfect for SMT solving because they’re <strong>pure conditional logic</strong>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if x[5] &lt;= 0.12:  
   left
else:             
   right
return leaf_label
</code></pre></div></div>

<p>Z3 can encode every branch of every tree.</p>

<p>Then we assert:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(pred_A(x) != pred_B(x))
</code></pre></div></div>

<p>and let Z3 do the search.</p>

<hr />

<h2 id="code-locating-the-exact-counterexample">Code: Locating the exact counterexample</h2>

<p>If a single violating input exists, Z3 returns it: guaranteed and verifiable.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">z3</span> <span class="kn">import</span> <span class="n">Real</span><span class="p">,</span> <span class="n">RealVal</span><span class="p">,</span> <span class="n">If</span><span class="p">,</span> <span class="n">And</span><span class="p">,</span> <span class="n">Or</span><span class="p">,</span> <span class="n">Sum</span><span class="p">,</span> <span class="n">Solver</span><span class="p">,</span> <span class="n">sat</span>

<span class="k">def</span> <span class="nf">encode_tree_as_z3</span><span class="p">(</span><span class="n">tree</span><span class="p">,</span> <span class="n">x_vars</span><span class="p">):</span>
    <span class="n">t</span> <span class="o">=</span> <span class="n">tree</span><span class="p">.</span><span class="n">tree_</span>
    <span class="n">L</span><span class="p">,</span> <span class="n">R</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">children_left</span><span class="p">,</span> <span class="n">t</span><span class="p">.</span><span class="n">children_right</span>
    <span class="n">feat</span><span class="p">,</span> <span class="n">thr</span><span class="p">,</span> <span class="n">val</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">feature</span><span class="p">,</span> <span class="n">t</span><span class="p">.</span><span class="n">threshold</span><span class="p">,</span> <span class="n">t</span><span class="p">.</span><span class="n">value</span>

    <span class="k">def</span> <span class="nf">go</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
        <span class="k">if</span> <span class="n">L</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">==</span> <span class="n">R</span><span class="p">[</span><span class="n">n</span><span class="p">]:</span>
            <span class="k">return</span> <span class="n">RealVal</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">val</span><span class="p">[</span><span class="n">n</span><span class="p">][</span><span class="mi">0</span><span class="p">].</span><span class="n">argmax</span><span class="p">()))</span>
        <span class="k">return</span> <span class="n">If</span><span class="p">(</span><span class="n">x_vars</span><span class="p">[</span><span class="n">feat</span><span class="p">[</span><span class="n">n</span><span class="p">]]</span> <span class="o">&lt;=</span> <span class="n">thr</span><span class="p">[</span><span class="n">n</span><span class="p">],</span>
                  <span class="n">go</span><span class="p">(</span><span class="n">L</span><span class="p">[</span><span class="n">n</span><span class="p">]),</span>
                  <span class="n">go</span><span class="p">(</span><span class="n">R</span><span class="p">[</span><span class="n">n</span><span class="p">]))</span>
    <span class="k">return</span> <span class="n">go</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">encode_forest_avg_vote</span><span class="p">(</span><span class="n">rf</span><span class="p">,</span> <span class="n">x_vars</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">Sum</span><span class="p">(</span><span class="o">*</span><span class="p">[</span><span class="n">encode_tree_as_z3</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">x_vars</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">rf</span><span class="p">.</span><span class="n">estimators_</span><span class="p">])</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">rf</span><span class="p">.</span><span class="n">estimators_</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">z3_label_counterexample</span><span class="p">(</span><span class="n">big</span><span class="p">,</span> <span class="n">pruned</span><span class="p">,</span> <span class="n">lo</span><span class="p">,</span> <span class="n">hi</span><span class="p">):</span>
    <span class="n">d</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">lo</span><span class="p">)</span>
    <span class="n">x</span> <span class="o">=</span> <span class="p">[</span><span class="n">Real</span><span class="p">(</span><span class="sa">f</span><span class="s">"x</span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s">"</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">d</span><span class="p">)]</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">encode_forest_avg_vote</span><span class="p">(</span><span class="n">big</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
    <span class="n">p</span> <span class="o">=</span> <span class="n">encode_forest_avg_vote</span><span class="p">(</span><span class="n">pruned</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>

    <span class="n">s</span> <span class="o">=</span> <span class="n">Solver</span><span class="p">()</span>
    <span class="n">s</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">And</span><span class="p">(</span><span class="o">*</span><span class="p">[(</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&gt;=</span> <span class="n">lo</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">&lt;=</span> <span class="n">hi</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">d</span><span class="p">)]))</span>
    <span class="n">s</span><span class="p">.</span><span class="n">add</span><span class="p">((</span><span class="n">b</span> <span class="o">&gt;=</span> <span class="mf">0.5</span><span class="p">)</span> <span class="o">!=</span> <span class="p">(</span><span class="n">p</span> <span class="o">&gt;=</span> <span class="mf">0.5</span><span class="p">))</span>

    <span class="k">if</span> <span class="n">s</span><span class="p">.</span><span class="n">check</span><span class="p">()</span> <span class="o">!=</span> <span class="n">sat</span><span class="p">:</span>
        <span class="k">return</span> <span class="bp">None</span>
    <span class="n">m</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">model</span><span class="p">()</span>
    <span class="k">return</span> <span class="p">[</span><span class="nb">float</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">m</span><span class="p">[</span><span class="n">xi</span><span class="p">]))</span> <span class="k">for</span> <span class="n">xi</span> <span class="ow">in</span> <span class="n">x</span><span class="p">]</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">lo</code> and <code class="language-plaintext highlighter-rouge">hi</code> are per-feature min/max bounds Z3 must respect. This prevents it from returning absurd values like x = 10⁹</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&gt;&gt;&gt; [-0.965, 0.549, 0.247, 0.589, 0.475, 3.397, ...]
</code></pre></div></div>
<p>This vector is a real point in feature space that breaks model equivalence.</p>

<h2 id="minimal-explanation-trace">Minimal explanation trace</h2>

<p>We can even trace which removed trees caused the discrepancy:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">trace_disagreement</span><span class="p">(</span><span class="n">x_cex</span><span class="p">,</span> <span class="n">big</span><span class="p">,</span> <span class="n">pruned</span><span class="p">,</span> <span class="n">top_k</span><span class="o">=</span><span class="mi">8</span><span class="p">):</span>
    <span class="n">xb</span> <span class="o">=</span> <span class="n">x_cex</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">votes</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">t</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">xb</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">big</span><span class="p">.</span><span class="n">estimators_</span><span class="p">],</span> <span class="nb">float</span><span class="p">)</span>
    <span class="n">removed</span> <span class="o">=</span> <span class="p">[</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">big</span><span class="p">.</span><span class="n">estimators_</span><span class="p">)</span> <span class="k">if</span> <span class="n">t</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">pruned</span><span class="p">.</span><span class="n">estimators_</span><span class="p">]</span>

    <span class="n">diffs</span> <span class="o">=</span> <span class="p">[(</span><span class="nb">abs</span><span class="p">(</span><span class="n">votes</span><span class="p">.</span><span class="n">mean</span><span class="p">()</span> <span class="o">-</span> <span class="p">((</span><span class="n">votes</span><span class="p">.</span><span class="nb">sum</span><span class="p">()</span> <span class="o">-</span> <span class="n">votes</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="o">/</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">votes</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">))),</span> <span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">removed</span><span class="p">]</span>
    <span class="n">diffs</span><span class="p">.</span><span class="n">sort</span><span class="p">(</span><span class="n">reverse</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="n">diffs</span><span class="p">[:</span><span class="n">top_k</span><span class="p">])</span>
</code></pre></div></div>

<p>This produces a ranked list of the trees that mattered with respect to the divergence of the two models.</p>

<h2 id="visualizing-the-disagreement-surface">Visualizing the disagreement surface</h2>

<table>
  <thead>
    <tr>
      <th>Interpretation</th>
      <th>Image</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Where vote probability differs the most</td>
      <td><img src="/assets/images/model_equivalence/1.png" alt="Heatmap showing where vote probability diverges the most between big and pruned model" /></td>
    </tr>
    <tr>
      <td>Where the predicted labels actually differ</td>
      <td><img src="/assets/images/model_equivalence/2.png" alt="Binary map showing where the two models disagree in predicted label across the same 2D feature slice" /></td>
    </tr>
    <tr>
      <td>Zoom on violation region with arbitrary difference</td>
      <td><img src="/assets/images/model_equivalence/3.png" alt="Zoomed-in view of disagreement region filtered to only high-confidence conflicting predictions" /></td>
    </tr>
  </tbody>
</table>

<p>These visuals are easy to track, and with some work, creating something for the most problematic combinations of features, would be very revealing for the pruning process.</p>

<p>Most of the input space is essentially the same, but we see precise “fault lines” where pruning changes the target predictions.</p>

<h2 id="why-this-matters">Why this matters</h2>

<table>
  <thead>
    <tr>
      <th>Approach</th>
      <th>Guarantees?</th>
      <th>Finds exact failure case?</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Validation set</td>
      <td>No</td>
      <td>No</td>
    </tr>
    <tr>
      <td><strong>Z3</strong></td>
      <td>Yes</td>
      <td>Always if exists</td>
    </tr>
  </tbody>
</table>

<p>Validation gives confidence, Z3 gives <strong>certainty</strong>.</p>

<h2 id="closing-remarks">Closing Remarks</h2>

<ul>
  <li>We can formally prove two models behave identically.</li>
  <li>Use it to validate pruning / distillation work.</li>
  <li>Use it to guard model retraining in CI/CD.</li>
  <li>If models diverge, Z3 gives the input that caused the divergence.</li>
</ul>

<p>No brute force,no test dataset guesses.</p>

<p>Just <strong>mathematically guaranteed model equivalence (or a counterexample).</strong></p>

<h2 id="further-reading--neural-network-equivalence-via-smt">Further reading — neural network equivalence via SMT</h2>

<p>The idea of proving that two models are equivalent (or extracting counterexamples when they aren’t) originates from formal verification research, in particular:</p>

<p>Eleftheriadis et al., On Neural Network Equivalence Checking Using SMT Solvers.
FORMATS 2022.
https://www.ccs.neu.edu/~stavros/papers/2022-formats-NN_Equivalence.pdf<br />
Their work focuses on neural networks and supports strict + approximate equivalence relations.</p>

<p>This post adapts a fairly similar encoding idea to decision-tree ensembles (random forests), making equivalence checking usable in practical ML pipelines.
Z3 effectively constructs the entire random forest as a single logical expression.</p>]]></content><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;nil, &quot;email&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Email&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-envelope-square&quot;, &quot;url&quot;=&gt;&quot;mailto:miguelcbatista@gmail.com&quot;}, {&quot;label&quot;=&gt;&quot;Twitter&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-twitter-square&quot;, &quot;url&quot;=&gt;&quot;https://twitter.com/mpcbatista&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;}, {&quot;label&quot;=&gt;&quot;Linkedin&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;}]}</name></author><category term="z3" /><category term="optimization" /><category term="model-equivalence" /><category term="machine-learning" /><category term="operations" /><summary type="html"><![CDATA[2025-11-07 — Using Z3 to prove two ML models are logically equivalent — or extract the exact counterexample where they diverge.]]></summary></entry><entry><title type="html">Quantifying Information Loss</title><link href="https://www.testingbranch.com/information_loss_and_noise/" rel="alternate" type="text/html" title="Quantifying Information Loss" /><published>2025-10-28T00:00:00+00:00</published><updated>2025-10-28T00:00:00+00:00</updated><id>https://www.testingbranch.com/information_loss_and_noise</id><content type="html" xml:base="https://www.testingbranch.com/information_loss_and_noise/"><![CDATA[<p>(This post comes from a series of old notebook ideas I’m revisiting — notes written years ago, now turned into posts.)</p>

<h2 id="why-measure-information-loss-when-adding-noise">Why measure information loss when adding noise?</h2>

<p>A <a href="https://www.johndcook.com/blog/2019/11/25/stochastic-rounding-and-privacy/">post on Cook’s blog</a> showed how rounding numeric values can act as a simple form of privacy.</p>

<p>That idea caught my attention: rounding is just a deterministic way of adding noise.<br />
So how much information do we actually lose when we do this?</p>

<p>This note looks answers that question.<br />
By adding Laplace noise (a common way to blur numeric data: small shifts most of the time, big ones only occasionally) of different magnitudes to a set of “ages” and measuring the mutual information with the original data, we can see how information degrades as noise grows and how that compares to ordinary binning.<br />
Each noise scale <em>b</em> has an equivalent bin width: the point where both destroy the same amount of information.</p>

<h2 id="setup">Setup</h2>

<p>We’ll start with a simple “age” variable drawn from a synthetic distribution, from 0-100. More realistic distributions seemed to reach roughly the same conclusions.  <br />
To each value, we add Laplace noise with different scales <em>b</em>, and measure how much mutual information remains between the noisy and original data.</p>

<p>For comparison, we also apply deterministic binning: rounding ages into 1-, 5-, and 10-year intervals.<br />
This acts as an upper bound on what the same magnitude of noise would erase.</p>

<p>The figure below maps the two: every noise scale <em>b</em> has an equivalent bin width where the information loss matches.</p>

<h2 id="results">Results</h2>

<p>Information drops smoothly as the noise scale increases.<br />
Small <em>b</em> values barely affect it, but once the noise exceeds a few years, most detail is gone.</p>

<p>The horizontal lines show fixed widths for comparison.<br />
Each crosses the Laplace curve at the point where both destroy the same amount of information which is a practical way to read noise as “effective resolution”.</p>

<p><img src="/assets/images/information_loss/info_loss_vs_b_pretty.png" alt="Information loss vs noise scale" /></p>

<hr />

<p>Noise defines an implicit resolution, which is how precisely a value can still be inferred, and binning defines it explicitly.<br />
Both erase the same amount of information, but they’re effectively not the same operation.</p>

<p>When you bin, you restrict knowledge to a clear interval: “this person is between 25 and 30”.
When you add noise, you blur every point independently — sometimes within that window, sometimes beyond it.</p>

<p>Both limit what can be learned, but only noise introduces uncertainty.</p>

<p>Binning is <strong>limited by the units we already use</strong>: we can round ages to years or to 5-year groups, but can not be smaller than the base unit.<br />
Noise isn’t bound by that because it can be <em>arbitrarily small or large</em>, adjusting precision continuously rather than in discrete steps.</p>

<h2 id="final-remarks">Final remarks</h2>

<ul>
  <li>
    <p><strong>Noise and binning set resolution differently.</strong><br />
One continuous, one discrete — both shape how much detail survives.</p>
  </li>
  <li>
    <p><strong>Noise is tunable.</strong><br />
Its scale <em>b</em> acts as a continuous knob on effective precision, unlike fixed bins.</p>
  </li>
  <li>
    <p><strong>Information loss is measurable.</strong><br />
Mutual information quantifies how much structure the data retain after perturbation.</p>
  </li>
  <li>
    <p><strong>At large noise scales, precision saturates.</strong><br />
Beyond the data’s natural granularity, extra noise only adds randomness.</p>
  </li>
</ul>

<p><a href="https://www.testingbranch.com/src_noise_info_loss/">Check the code</a></p>]]></content><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;nil, &quot;email&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Email&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-envelope-square&quot;, &quot;url&quot;=&gt;&quot;mailto:miguelcbatista@gmail.com&quot;}, {&quot;label&quot;=&gt;&quot;Twitter&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-twitter-square&quot;, &quot;url&quot;=&gt;&quot;https://twitter.com/mpcbatista&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;}, {&quot;label&quot;=&gt;&quot;Linkedin&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;}]}</name></author><category term="noise" /><category term="information" /><category term="data-privacy" /><summary type="html"><![CDATA[2025-10-28 — A quick experiment linking Laplace noise and data resolution, showing how privacy and precision trade off]]></summary></entry><entry><title type="html">Model based simulations</title><link href="https://www.testingbranch.com/model_based_simulation/" rel="alternate" type="text/html" title="Model based simulations" /><published>2021-06-09T00:00:00+00:00</published><updated>2025-10-30T00:00:00+00:00</updated><id>https://www.testingbranch.com/model_based_simulation</id><content type="html" xml:base="https://www.testingbranch.com/model_based_simulation/"><![CDATA[<p>This note walks through a simple but realistic case where Bayesian logistic regression helps simulate pricing scenarios — a model based way to explore sales decisions.</p>

<h2 id="why-use-bayesian-regression-for-model-based-simulations">Why use Bayesian regression for model based simulations?</h2>

<p>In this post, we’ll build a simple probabilistic model and use it to simulate a few scenarios.</p>

<p>Linear models handle noisy observations well — they stay focused on the main signal instead of chasing small fluctuations. Bayesian regression adds key advantages: we can encode domain knowledge as priors, quantify uncertainty directly from the posterior, and express results as probabilities rather than p-values or arbitrary confidence intervals.</p>

<p>For exploring counterfactual or simulated scenarios, that mix of simplicity and principled uncertainty is exactly what we need.</p>

<h2 id="case-study">Case Study</h2>

<p>A good example for this kind of modeling is converting sales opportunities.</p>

<p>Sales reps typically log opportunities under their accounts, along with details such as the offered unit price and whether the deal was ultimately won or lost.<br />
Some of these attributes are naturally informative, and, as with most purchases, price is often the dominant factor behind the conversion.</p>

<p>Still, the data doesn’t capture everything. Competitor pricing, credit limits, or internal approval rules can all affect the outcome, and their absence adds noise to the target variable.</p>

<p>Because we understand this process and the role of price so well, it makes an ideal test case for a simple Bayesian model.</p>

<h2 id="data">Data</h2>

<p>We’ll start with a small simulated dataset of 500 sales opportunities.<br />
Each record has a conversion status (<code class="language-plaintext highlighter-rouge">won</code> = 1, <code class="language-plaintext highlighter-rouge">lost</code> = 0) that depends mainly on the <strong>unit price</strong> offered, along with two categorical attributes — the <strong>account country</strong> and the <strong>product ID</strong>.<br />
Each product and country has its own coefficient, representing factors such as sales-rep behavior, discounts, or product-specific promotions.
(A full hierarchical version could share information across groups through hyperpriors, but here each level is modeled independently for simplicity.)</p>

<p>Additional random variation is included to capture unobserved factors that influence conversion — for instance, market competitiveness or credit conditions.</p>

<p>A sample of the dataset is shown below.<br />
For simplicity, the model will ignore the <code class="language-plaintext highlighter-rouge">amount</code> column (informative but unnecessary here) and focus on <code class="language-plaintext highlighter-rouge">unit_price</code>, <code class="language-plaintext highlighter-rouge">country</code>, and <code class="language-plaintext highlighter-rouge">product ID</code>.</p>

<table>
  <thead>
    <tr>
      <th>id</th>
      <th>unit_price</th>
      <th>p_id</th>
      <th>amount</th>
      <th>country</th>
      <th>status</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>325</td>
      <td>30.342365</td>
      <td>4</td>
      <td>62.734691</td>
      <td>a</td>
      <td>1</td>
    </tr>
    <tr>
      <td>457</td>
      <td>69.475791</td>
      <td>5</td>
      <td>20.922939</td>
      <td>c</td>
      <td>0</td>
    </tr>
    <tr>
      <td>351</td>
      <td>30.164137</td>
      <td>4</td>
      <td>73.612906</td>
      <td>b</td>
      <td>1</td>
    </tr>
    <tr>
      <td>224</td>
      <td>2.851734</td>
      <td>1</td>
      <td>207.205554</td>
      <td>c</td>
      <td>1</td>
    </tr>
    <tr>
      <td>123</td>
      <td>39.875412</td>
      <td>4</td>
      <td>7.842334</td>
      <td>b</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<p>Check the data generating <a href="https://www.testingbranch.com/src_model_simulation/">code</a> for the specifics.</p>

<p>The plots below show how the conversion target varies across the five simulated products and three countries.<br />
The division follows the ratio of offered price to base price, though it’s not a strict boundary.<br />
Higher simulation noise makes that division fuzzier and the classification problem harder overall.</p>

<figure>
  
<img src="/assets/images/bayesian_simulation/country_product.png" alt="Foo" />
 
</figure>

<h2 id="model">Model</h2>

<p>Let’s set up a simple model to study opportunity conversion.<br />
This isn’t meant to perfectly fit the data — just a basic object for controlled simulations.</p>

<p>We’ll use <strong>PyMC3</strong>(now PyMC) to implement a Bayesian version of logistic regression.<br />
If any part of the definition feels unclear, check their <a href="https://www.pymc.io/">examples and docs</a>.</p>

<p>The unit price values are normalized by each product’s base price.<br />
This scaling keeps features near zero, which simplifies the choice of priors.</p>

<p>The model includes linear terms for <strong>product</strong> and <strong>country</strong>, along with a common intercept, and uses a <strong>logit</strong> link to map the linear predictor to probabilities.<br />
These probabilities then define a <strong>Bernoulli</strong> likelihood for the conversion outcome.</p>

<p>Because we know price has the strongest influence, we’ll assign a prior that allows its coefficient to take relatively large (in magnitude) values.</p>

<p>Formally:</p>

<p>yᵢ = β₀ + β_prod[prodᵢ] + β_ctry[countryᵢ] + α_prod[prodᵢ]·priceᵢ + α_ctry[countryᵢ]·priceᵢ<br />
pᵢ = sigmoid(yᵢ)<br />
statusᵢ ∼ Bernoulli(pᵢ)</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  N=len(train_status)
  dim1 = len(set(product_id))
  dim2 = len(set(country))  

  with pm.Model() as shared_data_model: 

      intercept = pm.Normal('intercept', mu=0, sd=1)  

      alpha_product = pm.Normal('alpha_product', mu=0, sd=1, shape=dim1)
      alpha_country = pm.Normal('alpha_country', mu=0, sd=1, shape=dim2) 

      sigma_beta = 10
      beta_product = pm.Normal('beta_product', mu=0, sd=sigma_beta, shape=dim1)
      beta_country = pm.Normal('beta_country', mu=0, sd=sigma_beta, shape=dim2)  

      train_cty = pm.Data("train_cty", train_country)
      train_p = pm.Data("train_p", train_product_id)
      train_offers = pm.Data("train_offers", train_normalized_offers)
      train_p_cty = pm.Data("train_p_cty", train_p_cty)

      p = invlogit(intercept 
                   + alpha_product[train_p] 
                   + alpha_country[train_cty]    
                   + beta_product[train_p] * train_offers 
                   + beta_country[train_cty] * train_offers   
                  ) 

      y = pm.Bernoulli('y', p=p, observed=train_status) 

      trace = pm.sample(init='advi+adapt_diag', n_init=100000,
                            tune=1000, draws=1500, chains=3, cores=8,
                            target_accept=0.90, max_treedepth=10)
  az.plot_trace(trace, compact=True); plt.show()
</code></pre></div></div>

<p>This model is simple enough for this dataset, so there are no divergences or other diagnostics raising concerns.</p>

<figure>
  
<img src="/assets/images/bayesian_simulation/traceplot.png" alt="Foo" />
 
</figure>

<p>Let’s proceed.<br />
For simple models like this, even weaker priors would likely be informative enough. Adding extra terms, for example, industry or time components, would increase complexity and make sampling harder.<br />
When the data are noisy and the signal is faint, encoding knowledge through priors can help the sampler converge and stabilize inference.</p>

<p>Sampling from the posterior, we see that price cleanly separates converted from lost opportunities.<br />
The highest posterior density regions show how conversion probability shifts with normalized price.<br />
Alternative transformations, such as z-scoring price by product, produced slightly cleaner regions, but since model performance was identical, keeping the price as-is was more convenient for the simulations — the main focus of this post.</p>

<figure>
  
<img src="/assets/images/bayesian_simulation/train_posterior.png" alt="Foo" />
 
</figure>

<p>The key takeaway from this model is that when predicted probabilities align well with observed outcomes, the model’s predictive performance is strong.</p>

<p>Below we can see predictions on a hold-out set and the uncertainty of each, expressed as the standard deviation of their posterior predictive distributions.<br />
This helps gauge how much trust to place in each individual prediction.</p>

<p>For a Bernoulli variable, the posterior standard deviation is bounded by 0.5 — uncertainty is highest near p = 0.5 and decreases as probabilities approach 0 or 1.</p>

<figure>
  
<img src="/assets/images/bayesian_simulation/simul_prob_var_circle1.png" alt="Foo" />
 
</figure>

<p>In addition to the posterior prediction spread, we can also assess model uncertainty by inspecting the posterior distributions of its parameters.<br />
Parameters with wider posterior curves imply higher uncertainty, which naturally propagates into predictions.<br />
Listing their standard deviations is an intuitive way to see why some predictions carry more uncertainty than others.</p>

<h2 id="simulations--discounts-and-mark-ups">Simulations — Discounts and Mark-ups</h2>

<p>Linear models extrapolate reasonably well, though they assume a linear relation.<br />
That’s not always realistic — non-linear behavior appears when prices approach zero or climb far above the base level.</p>

<p>To explore what discount might turn a declined offer into a win, we can sample from the posterior at different price values.<br />
The plot below shows how conversions evolve with discounts; color encodes uncertainty (posterior standard deviation) across previously lost opportunities.<br />
Lower prices increase conversion probability, but products or countries with weaker signal-to-noise remain more uncertain.</p>

<figure>
  
<img src="/assets/images/bayesian_simulation/discount.png" alt="Foo" />
 
</figure>

<p>From the simulations, a discount of roughly <strong>20%</strong> is enough to recover nearly all lost opportunities — beyond that, additional cuts bring little gain and simply erode margin.</p>

<p>The same approach applies to price increases: we can examine how higher unit prices trade off revenue versus conversion loss.<br />
The figure below shows the decline in wins as prices rise — useful for identifying thresholds just below where business begins to drop.</p>

<figure>
  
<img src="/assets/images/bayesian_simulation/mark-up.png" alt="Foo" />
 
</figure>

<p>Both simulations behave as expected: lower prices drive conversions, higher ones reduce them, confirming the model’s internal consistency and our intuition about the problem.</p>

<hr />

<p>This post hopefully helped illustrate how we can use models to assist in simulating scenarios.<br />
More complex models will bring very interesting simulations, and optimizing these parameter landscapes will become a less trivial exercise.</p>

<p><a href="https://www.testingbranch.com/src_model_simulation/">Check the code and adjust noise parameters to explore different scenarios</a></p>]]></content><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;nil, &quot;email&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Email&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-envelope-square&quot;, &quot;url&quot;=&gt;&quot;mailto:miguelcbatista@gmail.com&quot;}, {&quot;label&quot;=&gt;&quot;Twitter&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-twitter-square&quot;, &quot;url&quot;=&gt;&quot;https://twitter.com/mpcbatista&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;}, {&quot;label&quot;=&gt;&quot;Linkedin&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;}]}</name></author><category term="bayesian" /><category term="simulation" /><category term="decision-making" /><category term="uncertainty" /><summary type="html"><![CDATA[2023-06-09 — Bayesian decision making applied to sales opportunities]]></summary></entry><entry><title type="html">Noise, Stability, and Calibration</title><link href="https://www.testingbranch.com/noise_study/" rel="alternate" type="text/html" title="Noise, Stability, and Calibration" /><published>2021-05-01T00:00:00+00:00</published><updated>2025-10-25T00:00:00+00:00</updated><id>https://www.testingbranch.com/noise_study</id><content type="html" xml:base="https://www.testingbranch.com/noise_study/"><![CDATA[<h2 id="why-study-model-calibration-under-noisy-data">Why study model calibration under noisy data?</h2>

<p>A few years ago, <strong>Claudia Perlich</strong> wrote on <a href="https://www.quora.com/What-are-some-of-the-biggest-misconceptions-about-data-science/answer/Claudia-Perlich">Quora</a> that <em>“linear models are surprisingly resilient to noisy data.”</em><br />
That line stuck with me because it contradicts the common instinct to reach for deeper or more powerful models when the data gets messy.</p>

<p>I wanted to revisit that claim, reproduce it in a small controlled setup, and then extend it a bit:<br />
What happens when we add <em>feature</em> noise instead of switching labels?<br />
And how does <strong>calibration</strong> (how well predicted probabilities align with reality) break down under both types of noise?</p>

<hr />

<h2 id="tldr">TL;DR</h2>

<ul>
  <li><strong>Linear models degrade gracefully</strong> when noise increases; their bias acts as regularization.</li>
  <li><strong>Tree ensembles hold AUC longer</strong> under moderate feature noise, but their <strong>calibration collapses faster</strong>.</li>
  <li>Once <strong>labels</strong> are corrupted, <em>no model survives</em>: information is lost, not just hidden.</li>
  <li>Calibration helps, but only while the underlying signal still exists.</li>
</ul>

<hr />

<h2 id="approach">Approach</h2>

<p>The idea was to simulate a clean, linearly separable world and then contaminate it in a controlled way.</p>

<ul>
  <li><strong>Data</strong>: 10 features, 5 informative, synthetic binary target generated with the <code class="language-plaintext highlighter-rouge">make_classification</code> method from sklean.</li>
  <li><strong>Noise</strong>:
    <ul>
      <li><em>Label noise</em>: randomly flipping 0↔1 with probability <em>p</em>.</li>
      <li><em>Feature noise</em>: adding Gaussian or Laplace perturbations, scaled to each feature’s standard deviation.</li>
    </ul>
  </li>
  <li><strong>Models</strong>:<br />
Logistic regression, Random Forest, and XGBoost, with and without isotonic calibration.</li>
  <li><strong>Metrics</strong>:<br />
AUC for discrimination; Expected Calibration Error (ECE) for reliability.</li>
</ul>

<p>Each configuration was run over multiple seeds and averaged, using up to 3 000 samples per run.</p>

<hr />

<h2 id="results">Results</h2>

<p><img src="/assets/images/noise_study/summary_grid.png" alt="plots" /></p>

<p>At first glance, intuition is confirmed:</p>

<ul>
  <li>Under <strong>label noise</strong>, all models decay in lock-step. Logistic doesn’t collapse faster than the trees; they all converge toward randomness once the labels stop meaning anything.</li>
  <li>Under <strong>feature noise</strong>, the picture splits:
    <ul>
      <li>Logistic remains smooth and predictable. Its linear boundary blurs but doesn’t overreact (much).</li>
      <li>RF and XGB start to memorize noise, retaining slightly higher AUC for a while but paying for it in calibration error.</li>
      <li>Calibration (the dashed lines) restores some sanity, but only when the signal is still recoverable.</li>
    </ul>
  </li>
</ul>

<p>The curves are remarkably smooth, with no weird bumps, no instability.<br />
Simple models with strong inductive bias prefer signal over noise.</p>

<hr />

<p>Why is the linear model so stable here?<br />
Because the underlying data was generated by a <strong>linear process</strong>. The logistic model has the right inductive bias — it assumes the true decision boundary is linear; so even as we inject random perturbations, it degrades gracefully.</p>

<p>Tree-based models, are flexible enough to “explain” small fluctuations as structure. That flexibility becomes a liability under noise: they overfit spurious splits, yielding high confidence on wrong examples, which shows up as poor calibration.</p>

<p>In the real world, this pattern often repeats: if your features already capture the main signal, linear baselines are hard to beat on stability. Complexity rarely saves you from bad data.</p>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>This small experiment validates Perlich’s observation and extends it slightly:<br />
noise doesn’t just make you wrong, it makes you confident in the wrong things.</p>

<p>Linear models trade expressive power for robustness.<br />
Tree ensembles fight noise longer, but they start lying about their certainty.</p>

<p><a href="https://www.testingbranch.com/src_noise_model/">Check the code and adjust noise distributions, switch datasets, try out different models. Have fun!</a></p>]]></content><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;nil, &quot;email&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Email&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-envelope-square&quot;, &quot;url&quot;=&gt;&quot;mailto:miguelcbatista@gmail.com&quot;}, {&quot;label&quot;=&gt;&quot;Twitter&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-twitter-square&quot;, &quot;url&quot;=&gt;&quot;https://twitter.com/mpcbatista&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;}, {&quot;label&quot;=&gt;&quot;Linkedin&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;}]}</name></author><category term="machine-learning" /><category term="noise" /><category term="calibration" /><summary type="html"><![CDATA[2021-05-01 — How models behave when data gets messy]]></summary></entry><entry><title type="html">Extending python with Go</title><link href="https://www.testingbranch.com/Extending_python_with_go/" rel="alternate" type="text/html" title="Extending python with Go" /><published>2021-04-03T00:00:00+00:00</published><updated>2021-04-03T00:00:00+00:00</updated><id>https://www.testingbranch.com/Extending_python_with_go</id><content type="html" xml:base="https://www.testingbranch.com/Extending_python_with_go/"><![CDATA[<p>This post is about extending python code with Go.<br />
Python’s ecosystem typically contains a great deal of what is needed, but for the cases when it doesn’t or when some bespoke development is justified, Go might be worth looking into. For one, the language is simple and the compiler forces whatever code you generate to maintain some readibility. <br />
After ad-hoc calls of Go code, structured calls of Go code, trying alternatives like <a href="https://www.ardanlabs.com/blog/2020/07/extending-python-with-go.html">this</a> or <a href="https://medium.com/@andreastagi/extending-python-with-go-part-1-6d0c8bb7dd56">this</a>, it seems like it can be simpler or just a bit more automated. <br />
Gopy generates (and compiles) a CPython extension module from a go package. It’s well maintained for linux environments and has plenty of examples to learn from. Installing Go and Gopy is straighforward to install and instructions are provided in <a href="https://github.com/go-python/gopy">Gopy’s</a> repository.<br />
—</p>

<p>To illustrate the process and expose the practical pitfalls of Gopy, let’s start with an implementation of a vantage-point tree from the <a href="https://github.com/gonum/gonum/blob/master/spatial/vptree/vptree.go">gonum project</a>.</p>

<p>First step: have Go code that you want to use in your python pipeline. This bit is an example from gonum that should be simple to follow: essentially from a collection of places and given a specific address, determine which are whithin a certain distance and also to display the top 5 closest distances.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>package vptree

import (
  "fmt"
  "log"
  "math"

  "gonum.org/v1/gonum/spatial/vptree"
)

func Example_accessiblePublicTransport() {
  // Construct a vp tree of train station locations
  // to identify accessible public transport for the
  // elderly.
  t, err := vptree.New(stations, 5, nil)
  if err != nil {
    log.Fatal(err)
  }

  // Residence.
  q := place{lat: 51.501476, lon: -0.140634}

  var keep vptree.Keeper

  // Find all stations within 0.75 of the residence.
  keep = vptree.NewDistKeeper(0.75)
  t.NearestSet(keep, q)

  fmt.Println(`Stations within 750 m of 51.501476N 0.140634W.`)
  for _, c := range keep.(*vptree.DistKeeper).Heap {
    p := c.Comparable.(place)
    fmt.Printf("%s: %0.3f km\n", p.name, p.Distance(q))
  }
  fmt.Println()

  // Find the five closest stations to the residence.
  keep = vptree.NewNKeeper(5)
  t.NearestSet(keep, q)

  fmt.Println(`5 closest stations to 51.501476N 0.140634W.`)
  for _, c := range keep.(*vptree.NKeeper).Heap {
    p := c.Comparable.(place)
    fmt.Printf("%s: %0.3f km\n", p.name, p.Distance(q))
  }
}

// stations is a list of railways stations.
var stations = []vptree.Comparable{
  place{name: "Bond Street", lat: 51.5142, lon: -0.1494},
  place{name: "Charing Cross", lat: 51.508, lon: -0.1247},
  place{name: "Covent Garden", lat: 51.5129, lon: -0.1243},
  place{name: "Embankment", lat: 51.5074, lon: -0.1223},
  place{name: "Green Park", lat: 51.5067, lon: -0.1428},
  place{name: "Hyde Park Corner", lat: 51.5027, lon: -0.1527},
  place{name: "Leicester Square", lat: 51.5113, lon: -0.1281},
  place{name: "Marble Arch", lat: 51.5136, lon: -0.1586},
  place{name: "Oxford Circus", lat: 51.515, lon: -0.1415},
  place{name: "Picadilly Circus", lat: 51.5098, lon: -0.1342},
  place{name: "Pimlico", lat: 51.4893, lon: -0.1334},
  place{name: "Sloane Square", lat: 51.4924, lon: -0.1565},
  place{name: "South Kensington", lat: 51.4941, lon: -0.1738},
  place{name: "St. James's Park", lat: 51.4994, lon: -0.1335},
  place{name: "Temple", lat: 51.5111, lon: -0.1141},
  place{name: "Tottenham Court Road", lat: 51.5165, lon: -0.131},
  place{name: "Vauxhall", lat: 51.4861, lon: -0.1253},
  place{name: "Victoria", lat: 51.4965, lon: -0.1447},
  place{name: "Waterloo", lat: 51.5036, lon: -0.1143},
  place{name: "Westminster", lat: 51.501, lon: -0.1254},
}

// place is a vptree.Comparable implementations.
type place struct {
  name     string
  lat, lon float64
}

// Distance returns the distance between the receiver and c.
func (p place) Distance(c vptree.Comparable) float64 {
  q := c.(place)
  return haversine(p.lat, p.lon, q.lat, q.lon)
}

// haversine returns the distance between two geographic coordinates.
func haversine(lat1, lon1, lat2, lon2 float64) float64 {
  const r = 6371 // km
  sdLat := math.Sin(radians(lat2-lat1) / 2)
  sdLon := math.Sin(radians(lon2-lon1) / 2)
  a := sdLat*sdLat + math.Cos(radians(lat1))*math.Cos(radians(lat2))*sdLon*sdLon
  d := 2 * r * math.Asin(math.Sqrt(a))
  return d // km
}

func radians(d float64) float64 {
  return d * math.Pi / 180
}   
</code></pre></div></div>

<p>This is a particularly good example of code to take. The definition of the distance can easily be changed to reflect similarity in words, geometric spaces or whatever rule that makes sense to classify as similar.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  gopy build -output=some/folder -vm=python3 path/to/go_pkg
</code></pre></div></div>

<p>This creates the shared library and other objects needed for the binding.</p>

<p>For the the shared objects(.so), there is one extra step before interacting with the bindings, which is to add a new path to the LD_LIBRARY_PATH variable to tell the dynamic link loader where to search for the dynamic shared libraries. There is a long lasting <a href="https://github.com/go-python/gopy/issues/203">issue</a>, where all steps are descbribed.<br />
If you’re in the location of the generated folder add the current working directory ($PWD) to the environment variable, else adjust it accordingly.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$PWD python3
</code></pre></div></div>

<p>After it, you’re free to import vptree and use it with little to no issues.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
    [GCC 7.3.0] :: Anaconda, Inc. on linux
    Type "help", "copyright", "credits" or "license" for more information.
    &gt;&gt;&gt; import vptree
    &gt;&gt;&gt; vptree.Example_accessiblePublicTransport()
    Stations within 750 m of 51.501476N 0.140634W.
    St. James's Park: 0.545 km
    Green Park: 0.600 km
    Victoria: 0.621 km

    5 closest stations to 51.501476N 0.140634W.
    St. James's Park: 0.545 km
    Green Park: 0.600 km
    Victoria: 0.621 km
    Hyde Park Corner: 0.846 km
    Picadilly Circus: 1.027 km
</code></pre></div></div>

<hr />

<p>Simple enough.
This was a very simple example, but it seems to generalize well to more complex Go code.</p>]]></content><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;nil, &quot;email&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Email&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-envelope-square&quot;, &quot;url&quot;=&gt;&quot;mailto:miguelcbatista@gmail.com&quot;}, {&quot;label&quot;=&gt;&quot;Twitter&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-twitter-square&quot;, &quot;url&quot;=&gt;&quot;https://twitter.com/mpcbatista&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;}, {&quot;label&quot;=&gt;&quot;Linkedin&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;}]}</name></author><category term="python" /><summary type="html"><![CDATA[2021-04-03 — A simple example of how GoPy can be used to extend python with Go native code]]></summary></entry><entry><title type="html">Subsampling as a strategy to find optimal parameters (2/2)</title><link href="https://www.testingbranch.com/optimization_sample_fusion/" rel="alternate" type="text/html" title="Subsampling as a strategy to find optimal parameters (2/2)" /><published>2021-03-22T00:00:00+00:00</published><updated>2021-03-22T00:00:00+00:00</updated><id>https://www.testingbranch.com/optimization_sample_fusion</id><content type="html" xml:base="https://www.testingbranch.com/optimization_sample_fusion/"><![CDATA[<p>This is the continuation to this <a href="https://www.testingbranch.com/parameter_optimization_subsampling/">post</a> where we explore the changes in the output of machine learning models when they are trained on samples of varying sizes.</p>

<hr />

<p>Some preliminary thoughts and conclusions from the last post:</p>
<ol>
  <li>Complexity of the models determine how profitable it is to explore in lower samples. For quadratic algorithms and ignoring the actual implementation, training one model with a full dataset should cost the same amount of time as training 6.25 with 40% of that dataset.</li>
  <li>Before a threshold sampling percentage, the results of a model are not informative for the full dataset. Being greedy doesn’t help here.</li>
  <li>Overall, models at smaller samples seem to be noisier images of the full dataset trained models.</li>
</ol>

<hr />

<p>Let’s take this <a href="https://github.com/fmfn/BayesianOptimization/blob/master/examples/sklearn_example.py">example</a> from <a href="https://github.com/fmfn/BayesianOptimization">bayes_opt</a>. <br />
We want to optimize over a 3d space composed of random forest parameters (max_features, min_sample_split, trees) where the model is evaluated a cross-validated negative log-loss score.<br />
A standard bayesian optimizer runs 100 models with a synthetically generated dataset with a binary target.</p>

<p>Let’s do a run in an exploratory mode: focusing one exploring the landscape instead of necessarily exploiting regions near local or global maxima.</p>

<p>The logs below print how many models were ran during the bayesian optimization and the computational budget it consumed. <br />
We print the best combination of parameters until the last iteration. During this process, the best model was found in iteration 13 and the remaining 87 never resulted in a superior model.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Optimizing Random Forest: 100 models; budget: 100 
|   iter    |  target   | max_fe... | min_sa... | n_esti... |
=============================================================
|  4        | -0.3418   |  0.4902   |  0.02077  |  10.02    |
|  8        | -0.3349   |  0.6166   |  0.01651  |  10.02    |
|  13       | -0.2919   |  0.999    |  0.01     |  250.0    |
=============================================================
</code></pre></div></div>

<p>The figure below shows a fairly explored space, where some regions clearly seem to have more performant models (lighter tone in the color scale).</p>

<figure>
  
<img src="/assets/images/bayes_opt_variation/full_1.png" alt="Foo" />
 
</figure>

<p>Another run promoting a more balanced relation between exploration and exploitation:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Optimizing Random Forest: 100 models; budget: 100
|   iter    |  target   | max_fe... | min_sa... | n_esti... |
=============================================================
|  4        | -0.3423   |  0.4902   |  0.02077  |  10.02    |
|  5        | -0.3404   |  0.999    |  0.01     |  10.0     |
|  6        | -0.3149   |  0.9771   |  0.01587  |  188.5    |
|  12       | -0.2959   |  0.999    |  0.01     |  96.31    |
|  16       | -0.2946   |  0.999    |  0.01     |  141.4    |
|  51       | -0.2943   |  0.999    |  0.01     |  122.7    |
|  95       | -0.2942   |  0.999    |  0.01     |  126.5    |
=============================================================
</code></pre></div></div>

<figure>
  
<img src="/assets/images/bayes_opt_variation/full_1_exploit.png" alt="Foo" />
 
</figure>

<p>Both strategies seem to be effective exploring the parameters. Let’s add subsampling to it.</p>

<hr />

<p>To benchmark all the variants in the exploration we fix the same compute budget, that is, the same that is needed to run 100 models with the full dataset.<br />
Let’s consider negligeble the compute time for the gaussian process that fits the hyperparameter space, even though it isn’t: 1) as observations grow; 2) as the hyperparameter space grows; 3) and if the model function is not too expensive to compute.</p>

<p>Some remarks:</p>
<ol>
  <li>The computational budget is divided between different sample sizes, and for each size, we can quantify a quantity of models given by the complexity relation. Random forests are assumed to be log-linear; SVMs are considered to be quadratic. Abstracting some implmentation details is acceptable to get started.</li>
  <li>The lowest percentage which samples the dataset is fixed and it should be tuned depending on the complexity of the data. 1% of the data may be enough to learn the target.</li>
  <li>How many sample sizes should we explore? Not enough has been explored here, but it seems to be again contingent on the data.</li>
  <li>What strategy to use when dividing the budget: evenly over the various sample sizes or something more complex?</li>
</ol>

<hr />

<p>The key concern once the exploration in a sample is completed, is how to pass the information gathered and pass it to the next (larger) sample exploration.<br />
A simple step is to pass promissing points to probe; adjusting the domain of the parameters in order not to explore flat areas is promissing but not easy to implement in bayes_opt; another simple way to pass information is to copy the posterior (after fitting to the observations) covariance function of the underlying gaussian process and use it as a prior to the optimization process of the following sample.</p>

<p>Some ideas to make subsampled bayesian optimization more clever where: 1) to make the strategy of exploration sample size dependent. Exploring at lower samples and exploiting at higher samples seems a good heuristic; 2) to add a noise term to the gaussian process that is dependent on the sample size. This is to model the higher variance at lower samples.</p>

<hr />

<p>The logs and figures below show the result of a 3 sample size split, with sample percentages in the following set: [30%, 70%, 100%], where the computational budget (the equivalent of training 100 models with the full dataset) was split evenly. <br />
Notice that the same budget allows for a very different amount of models at each sample size. Allocating more budget to lower or higher percentages could ease the exploration of more complex parameter spaces.</p>

<p>Sample percentage:30%</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Optimizing Random Forest: 148 models; budget: 33 
|   iter    |  target   | max_fe... | min_sa... | n_esti... |
=============================================================
|  4        | -0.3858   |  0.4902   |  0.02077  |  10.02    |
|  9        | -0.3753   |  0.9791   |  0.01979  |  23.29    |
|  19       | -0.3539   |  0.6998   |  0.02344  |  241.5    |
|  43       | -0.3354   |  0.999    |  0.01     |  221.5    |
|  112      | -0.3344   |  0.999    |  0.01     |  46.79    |
|  137      | -0.3343   |  0.999    |  0.01     |  40.97    |
=============================================================
</code></pre></div></div>

<figure>
  
<img src="/assets/images/bayes_opt_variation/0_3.png" alt="Foo" />
 
</figure>

<p>Sample percentage:70%</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Optimizing Random Forest: 57 models; budget: 33
|   iter    |  target   | max_fe... | min_sa... | n_esti... |
=============================================================
|  3        | -0.3023   |  0.999    |  0.01     |  114.4    |
|  5        | -0.302    |  0.999    |  0.01     |  234.8    |
|  42       | -0.3006   |  0.7511   |  0.0109   |  67.98    |
=============================================================
</code></pre></div></div>

<figure>
  
<img src="/assets/images/bayes_opt_variation/0_7.png" alt="Foo" />
 
</figure>

<p>Sample percentage:100%</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Optimizing Random Forest: 33 models; budget: 33 
|   iter    |  target   | max_fe... | min_sa... | n_esti... |
=============================================================
|  32       | -0.2917   |  0.7301   |  0.01     |  250.0    |
=============================================================
</code></pre></div></div>

<figure>
  
<img src="/assets/images/bayes_opt_variation/1.png" alt="Foo" />
 
</figure>

<hr />

<p>To conclude, this was not an exhaustive search of how subsampling can be used to search optimal points in the hyperparameter space. <br />
The code below can be changed to:</p>
<ol>
  <li>Explore different budget dividing strategies;</li>
  <li>Explore different amounts of noise in each of the sample percentages;</li>
  <li>Investigate how to leverage the exploit vs explore trade-off;</li>
  <li>Explore some relation between the observed points and the number of points to pass across samples;</li>
  <li>Explore how different these points should be; <br />
…</li>
</ol>

<p>The exploration of higher dimensional spaces at lower samples seems at least more effective than relying on very expensive model full dataset training. For models with higher complexity this should be evident.</p>

<hr />

<p><a href="https://www.testingbranch.com/bayes_opt_subsampled/">Code</a></p>]]></content><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;nil, &quot;email&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Email&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-envelope-square&quot;, &quot;url&quot;=&gt;&quot;mailto:miguelcbatista@gmail.com&quot;}, {&quot;label&quot;=&gt;&quot;Twitter&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-twitter-square&quot;, &quot;url&quot;=&gt;&quot;https://twitter.com/mpcbatista&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;}, {&quot;label&quot;=&gt;&quot;Linkedin&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;}]}</name></author><category term="machine-learning" /><category term="optimization" /><summary type="html"><![CDATA[2021-03-22 — Probing and fusing parameter space explorations]]></summary></entry><entry><title type="html">Identifying outliers in time series</title><link href="https://www.testingbranch.com/outliers_time_series/" rel="alternate" type="text/html" title="Identifying outliers in time series" /><published>2021-02-20T00:00:00+00:00</published><updated>2025-10-30T00:00:00+00:00</updated><id>https://www.testingbranch.com/outliers_time_series</id><content type="html" xml:base="https://www.testingbranch.com/outliers_time_series/"><![CDATA[<h2 id="modeling-time-series-outliers-with-gaussian-processes">Modeling time series outliers with Gaussian Processes</h2>

<p>This is essentially a back of the envelope study for the identification of outliers in time series. The idea is to sketch a method that associates model quality to the presence of outliers.<br />
When dealing with data which does not follow time, finding outliers is hard, but even simple approaches might yield decent results. Setting a threshold for the percentile that determines what is a classifier works nicely in one dimensional data, and might even be useful in low dimensional data. To make sure, add a Bonferroni Outlier test and whatever you decide, it has some support.<br />
For time series that evidently not a satisfying answer. Even very rare values can be periodical; this is in fact a common pattern.</p>

<p>One definition of outlier is a measurement that does not fit with the data generating process. <br />
For a sufficient amount of samples, the signal makes itself clear, even when in presence of <em>significant</em> noise. Let’s focus on the problem with a small amount of data points — something like monthly series, a frequent business scenario.</p>

<hr />

<p>Let’s create a small series which has the following decomposition:</p>

<figure>
  
<img src="/assets/images/outliers_ts/signal_decomposition.png" alt="Foo" />
 
</figure>

<p>We need to create a generating model which will be critical to evaluate the likelihood of a point being an outlier. Let’s not use the entirity of our knowledge of the series.
For a series this simple, we’ll use a gaussian processes regression, and we’ll define the covariance function as the sum of the Matérn 5/2 kernel and a periodic kernel of period 12. We also define a linear mean funcion, with the slope defined with by a random variable that is inferred using MCMC.<br />
This is as a plausible injection of basic yet <em>informative</em> priors to the model. More complex series may require complex models, which are harder to sample and harder to illustrate the idea of the post.</p>

<hr />

<p>A quick test, performing a 12 month forecast, shows the model capturing the signal pretty well when there is no noise, and as the noise magnitude increases,the predictive ability decreases, as expected.</p>

<figure>
  
<img src="/assets/images/outliers_ts/forecast_snr.gif" alt="Foo" />
 
</figure>

<hr />

<p>One particular feature that I enjoy in gaussian processes is the ability to interpolate data very nicely and allowing imputation of missing values in a principled way. Because we can draw samples from the model, we can generate distributions for each of the points in the series. We exclude each point and, assuming the model is sufficiently well defined, we could collect percentile values for the values of the series.<br />
Now, the outcome of this step seems redundant or a symptom of a weak model/a hard problem to model. However, the main objective is to show that the absence of outliers will generate a superior model, that reduces the modelling error significantly for the rest of the series.</p>

<p>This extra iteration over the entire series adds a significant amount of computing time to an already complex method — but this is a small series and a few seconds per model is nothing obscene
Below we get to see the process for the series with minimal amount of noise.</p>

<figure>
  
<img src="/assets/images/outliers_ts/interp.gif" alt="Foo" />
 
</figure>

<p>Even with a model that’s as close to naive as possible, the signal is picked and for the most part the mean of the samples matches the missing value.</p>

<hr />

<p>The key idea behind this post is the superior model that is trained when outliers are removed from the time series. This is something made clear in the next animation. When the outlier is removed, the variance of the model is very low compared to the models generated when it’s not, and most importantly, the mean of the samples of the posterior resembles the original time series.</p>

<figure>
  
<img src="/assets/images/outliers_ts/interp_1_outlier.gif" alt="Foo" />
 
</figure>

<p>In case of more than one outlier, the problem stops being trivial; the absence of one of the outlier is no longer sufficient for the model to obtain a clear signal — depending on the magnitude of the outlier, the perturbation it adds makes any modelling quite hard, forcing the removal of another data point.</p>

<figure>
  
<img src="/assets/images/outliers_ts/interp_2_outlier_2.gif" alt="Foo" />
 
</figure>

<p>We can see the model seems to improve when any of the outliers are removed; measuring this improvement might be sufficient to list outlier candidates and then removing the combinations of elements of this list to find out the most promising set.<br />
It’s still hard to generate some heuristic that generalizes for most outliers — any value with an offset large enough from the original time series is easily detectable, but after removed, requires one additional loop over the remaining points; even with a small amount of data, it gets prohibitive.</p>

<hr />

<p>The main conclusions taken from this post is that outliers are hard to evaluate, unless if specific cases.<br />
Here, knowledge of the system was needed to build a <em>good enough</em> model; of the amount of noise that influences the process ; and the of how many outliers are in series were essential. In a real world scenario, this is not the case, but it may still serve as an exploratory step.</p>]]></content><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;nil, &quot;email&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Email&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-envelope-square&quot;, &quot;url&quot;=&gt;&quot;mailto:miguelcbatista@gmail.com&quot;}, {&quot;label&quot;=&gt;&quot;Twitter&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-twitter-square&quot;, &quot;url&quot;=&gt;&quot;https://twitter.com/mpcbatista&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;}, {&quot;label&quot;=&gt;&quot;Linkedin&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;}]}</name></author><category term="time-series" /><category term="gaussian-processes" /><category term="outliers" /><category term="anomaly-detection" /><category term="bayesian" /><summary type="html"><![CDATA[2021-02-20 — Observing the impact of outliers in small time series using gaussian processes]]></summary></entry><entry><title type="html">Subsampling as a strategy to find optimal parameters (1/2)</title><link href="https://www.testingbranch.com/parameter_optimization_subsampling/" rel="alternate" type="text/html" title="Subsampling as a strategy to find optimal parameters (1/2)" /><published>2021-02-14T00:00:00+00:00</published><updated>2021-02-14T00:00:00+00:00</updated><id>https://www.testingbranch.com/parameter_optimization_subsampling</id><content type="html" xml:base="https://www.testingbranch.com/parameter_optimization_subsampling/"><![CDATA[<p>This is the first of two posts about finding optimal parameters for machine learning models, and is motivated by <a href="https://arxiv.org/abs/2003.05689">Hyper-Parameter Optimization: A Review of Algorithms and Applications</a>. Subsampling is commented on a later section, as a strategy to reduce the training time; and doing so, reduces the search time for necessary to find optimal values. It’s stated that subsampling is risky in terms of the potential to introduce more noise and uncertainty. <br />
On this first post I’m going to explore subsampling and how much can be learned about the parameter space from it. Try to find support from observations that the parameters tend to converge to the some value as training data increases, and if it seems to increase at the same rate for all parameters; observe the effect in combinations of parameters; and to see if the overall patterns generalizes in the same way to different methods.<br />
On the <a href="https://www.testingbranch.com/optimization_sample_fusion/">second post</a> I’ll use subsampling to explore the parameter space, using whatever computational budget there is, efficiently.</p>

<hr />

<p>Predictive models are able to generalize what they learn from data, otherwise they’re not very useful – but there is a minimum amount of data which serves as a boundary for the lack of learning ability of the model.</p>

<p>Let’s pick then the Random Forest implementation by sklearn, and pick three parameters as a start. The data is the Boston housing dataset, around 500 records which is a nice amount to perform some grid searches for these initial tests. I’ll track the R² score obtained from cross validation.</p>

<p>What we expect to find is evidence that smaller samples are still informative to find the optimal parameters of the full dataset.<br />
For each size and combination of parameters, we randomly sample from the original dataset, while keeping the random seed of the algorithm to maintain some amount of determinism.<br />
The plots below show the results for grid search for the parameters. Keep the colorscale in mind; it’s used for the rest of the plots.</p>

<figure>
  
<img src="/assets/images/hyperparam_sampling/one_param.png" alt="Foo" />
 
</figure>

<p>Some comments:</p>
<ol>
  <li>Less data means less to learn: the performance must necessarily be smaller with small samples.</li>
  <li>Less data means greater variation in how the training data can be generated – some combinations can be very unrepresentative of the dataset. This needs to be explored further.</li>
  <li>For some parameters – max features – subsampling seems to have a greater impact on the predictive performance of the model.</li>
</ol>

<p>As expected, the parameter curves seems to share a lot of features across sample sizes, even at the smallest ones, and as the size increases the convergence seems to be even more apparent. To put it simply, the optimal parameters are quite similar after a certain size.</p>

<hr />

<p>Let’s explore the variation of the scores a bit further.<br />
Let’s focus on one parameter, min_sample_split and repeat the exploration a few times.</p>

<figure>
  
<img src="/assets/images/hyperparam_sampling/variance_mss.png" alt="Foo" />
 
</figure>

<p>And we can see the mean value and the region of the two standard deviations.</p>

<figure>
  
<img src="/assets/images/hyperparam_sampling/mean_and_sd_.png" alt="Foo" />
 
</figure>

<p>Smaller samples have a greater variance than larger samples – there are certainly odd combinations, in particular for a small dataset as this one.<br />
Larger samples will have a significant overlap with the orignal dataset which means that there’s less to vary in the training data, and as expected, the results are much less scattered.<br />
In my implementation, even when the sample sizes matches the entire dataset, the order is not fixed by a random seed and is not kept; this adds additional variation, which is not only acceptable, but actually desirable to explore the what causes different outcomes in these methods, and to get an idea of the magnitude of the variation.</p>

<hr />

<p>Before trying to observe the same patterns hold in other scenarios, let’s take a minute and explore, fixing one parameter, the distributions for the scores at each sample size. The reason is to start observing the actual dynamics of these distributions as more data gets fed to the model.</p>

<figure>
  
<img src="/assets/images/hyperparam_sampling/distributions.gif" alt="Foo" />
 
</figure>

<p>Below we can see that if we exclude very low sample sizes – there is a very significant overlap in the R² scores distributions. In other words, there is plenty of useful information at lower samples.</p>

<figure>
  
<img src="/assets/images/hyperparam_sampling/distributions_0.gif" alt="Foo" />
 
</figure>

<p>On interesting aspect of these repeated draws at a specific sample size seems to be the bell shape of the distribution. Even with ceiling effects, normality tests are passed and we can at least use this as information to model the behavior in the second post.</p>

<hr />

<p>One continuation to this exploration needs to be related with the relation between sample size and more than one parameter. <br />
We have two examples of combinations of parameters that support what we saw previously: the surfaces share plenty of similarities for various sample sizes, and most importantly, the parameters that maximize the R² score seem to be at the very least neighbours and at the very best the exact same.</p>

<figure>
  
<img src="/assets/images/hyperparam_sampling/min_min.gif" alt="Foo" />
 
</figure>

<figure>
  
<img src="/assets/images/hyperparam_sampling/min_max.gif" alt="Foo" />
 
</figure>

<hr />

<p>One comment regarding how different datasets affect what we have observed. There seems to be a point in the sample size after which the dataset is informative enough for the model to pick up patterns. Rich datasets are a bit more demanding when it comes to how small samples can be – this adds another layer of variance to what we see and that is ignored for now.</p>

<hr />

<p>One final exploration with a SVM is carried out. There is no scaling of the features or anything remotely trying to maximize the performance; again, the point is to see that there is no dramatic changes to the model behaviour.</p>

<figure>
  
<img src="/assets/images/hyperparam_sampling/svm.gif" alt="Foo" />
 
</figure>

<p>We can see that for the very smallest sample size, the maximum seems to be contested by two distinct locations in the parameter space. Small samples are troublesome as we’ve seen and different algorithms scale differently.<br />
The second smallest sample already seems to converge to the optimal set of parameters and at this point we can make some comments that motivate the next post.<br />
Our first method was the random forest, which scales nicely, log-linearly. SVMs on another hand have quadratic time complexity, and this is one aspect to explore and that motivates subsampling in this exploration.<br />
To compute one model with the entire dataset it costs us the same as the cost of training seven in the second smallest sample size: there is intrinsic noise that comes from small samples, yet we can explore a lot more.   <br />
A structured, yet simple, approach to how this exploration can be made, leveraging the trade-off between information gained and reduced computational load, is the content of the next post.</p>]]></content><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;nil, &quot;email&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Email&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-envelope-square&quot;, &quot;url&quot;=&gt;&quot;mailto:miguelcbatista@gmail.com&quot;}, {&quot;label&quot;=&gt;&quot;Twitter&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-twitter-square&quot;, &quot;url&quot;=&gt;&quot;https://twitter.com/mpcbatista&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;}, {&quot;label&quot;=&gt;&quot;Linkedin&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-linkedin&quot;}]}</name></author><category term="optimization" /><category term="machine-learning" /><summary type="html"><![CDATA[2021-02-14 — Using smaller samples to find optimal parameters for machine learning models]]></summary></entry></feed>