Methodology — Obsidic

Philosophy

Baseball is uniquely suited to quantitative modeling. Unlike free-flowing sports where dozens of players interact simultaneously, a baseball game decomposes into discrete events: one pitcher throws to one batter, and the outcome of that confrontation depends on a relatively contained set of factors. This structure is a gift to anyone trying to predict what will happen next.

But "relatively contained" doesn't mean simple. The outcome of a plate appearance is shaped by the pitcher's arsenal, the batter's tendencies, the count, the park dimensions, the day's weather, the umpire behind the plate, who's on base, and a long tail of subtler forces; fatigue, game state pressure, lineup protection, and pure randomness. Any model that collapses this complexity into a single formula is leaving information on the table.

Obsidic's approach is to model each of these layers separately and then let them interact naturally through simulation. Rather than training one model to predict "who wins," we build a system of specialized models, each responsible for a narrow slice of the game and we connect them through thousands of simulated plate appearances that mirror how a real game unfolds.

Design Principle The projection should emerge from the simulation, not precede it. We don't predict a final score and work backward, we simulate every at-bat and let the results accumulate.

Training Data Scale

The Projection Pipeline

Each game passes through a sequence of stages before projections are published. The pipeline runs daily and adapts to the information available, early projections use expected lineups, and final projections lock in once official lineups are posted.

📊 Profiles Player ability models updated with latest data

→

⚔️ Matchups Batter-pitcher probability distributions

→

🌤️ Environment Park, weather, and umpire adjustments

→

🎲 Simulation Thousands of full nine-inning games

The output isn't just a single number, it's a distribution. For every game, we produce win probabilities, projected run totals with standard deviations, over/under probabilities at multiple lines, first-inning scoring rates, first-five-inning projections, and individual player prop distributions. All of these emerge from the same simulation rather than from separate models, which means they're internally consistent.

Player Ability Models

At the foundation of everything are the player profiles. These are statistical representations of what each batter and pitcher is likely to do in a given plate appearance, built from pitch-level Statcast data spanning multiple seasons. The training set includes millions of individual pitches from 2021 through 2025, with full physical tracking; velocity, movement, spin, location, as well as the outcomes they produced.

Separating Skill from Noise

A central challenge in baseball modeling is distinguishing a player's true ability from the noise of small samples. A batter who goes 0-for-20 hasn't necessarily declined; a pitcher with a 1.50 ERA in April isn't necessarily an ace. Surface-level stats are heavily contaminated by luck, sequencing, and defensive alignment.

Our profiles use machine learning to estimate ability from the physical characteristics of each event; how hard the ball was hit, at what angle, and with what trajectory rather than relying solely on whether the outcome happened to be a hit or an out. This means our model can identify when a player's results are running ahead of or behind their underlying quality. A batter making consistently hard contact who's batting .220 will be projected more favorably than a .280 hitter surviving on soft contact and fortunate placement.

These expected performance metrics are derived from batted-ball physics rather than box-score outcomes. They are computed for every batter and pitcher and persisted in their profiles. They serve as a reality check throughout the pipeline, anchoring projections to contact quality rather than results that may be inflated or deflated by luck.

Handling Limited Data

Not every player has a deep track record. Call-ups, mid-season acquisitions, and players returning from injury may have only a few hundred pitches of usable data. Projecting these players requires care. Projecting too aggressively from thin samples is worse than projecting conservatively.

Obsidic uses a Bayesian-inspired regression approach for thin samples: players with limited data are blended toward population-level baselines at a rate proportional to their sample size. As more data accumulates, the player's individual signal gradually overtakes the prior. The threshold and blending curve were determined empirically through backtesting rather than chosen by intuition.

Why This Matters Early in the season, when most models struggle, this regression framework prevents the kind of wild projections that come from taking two-week stat lines at face value. By mid-season, profiles stabilize and the regression fades into the background.

Sample Size Blending — Individual Signal vs Population Baseline

Early season: heavy regression to baseline. Mid-season: individual performance dominates.

The Matchup Engine

Baseball is one of the few sports where two players directly confront each other on every play. The matchup engine takes a specific batter and a specific pitcher and produces a probability distribution across all possible plate appearance outcomes: strikeout, walk, single, double, triple, home run, hit by pitch, and the various categories of outs.

This isn't a simple average of the batter's rates and the pitcher's rates. A dedicated machine learning model ingests both player profiles simultaneously and learns the non-linear interactions between them; how specific types of hitters perform against specific types of pitchers. A batter who struggles with high-velocity fastballs will be projected differently against a power arm than against a soft-tossing control artist, even if both pitchers have similar overall stat lines. The interaction between batter and pitcher characteristics is where much of the predictive signal lives.

Platoon Dynamics

Handedness matchups are among the most well-documented effects in baseball. Left-handed batters tend to perform worse against left-handed pitchers, and vice versa. But the magnitude of this effect varies enormously between players as some batters show almost no platoon split, while others are dramatically worse against same-side pitching. Our profiles capture player-specific platoon effects rather than applying a uniform adjustment.

What the Matchup Engine Produces

For each batter-pitcher combination, the engine outputs a complete probability vector. These probabilities are what drive the simulation and each simulated plate appearance draws from this distribution to determine the outcome. The probabilities are adjusted in real-time for the game's environmental context before they enter the simulation.

Example: Matchup Probability Vector

Environment & Park Factors

The same fly ball that clears the wall at Yankee Stadium's short right-field porch dies on the warning track at Comerica Park. The same pop fly that drifts harmlessly in still air becomes an adventure on a windy afternoon at Wrigley. Environment matters, and it matters differently for every hitter.

Weather Modeling

Obsidic pulls real-time weather data for every game; temperature, humidity, barometric pressure, wind speed, and wind direction. These variables are fed into a physics model that calculates their combined effect on batted ball carry. Cold, dense air suppresses fly balls; warm, thin air lets them travel. Wind can amplify or counteract these effects depending on direction relative to the park orientation.

The carry factor isn't a simple multiplier applied to the final score. It enters the simulation at the batted-ball level, affecting the probability that a fly ball becomes a home run, an out, or something in between. This means the weather adjustment is naturally weighted toward players who hit fly balls whereas a ground-ball pitcher in a hitter-friendly wind environment won't see much of an impact, nor should he.

Park Factors

Each of MLB's 30 ballparks has its own character. Some favor hitters, some favor pitchers, and the effect can differ by outcome type. A park might suppress home runs but boost doubles, or vice versa. Our park model is trained on years of batted-ball outcomes and learns these per-park, per-outcome tendencies while controlling for the quality of the teams that happened to play there.

Critically, park and weather effects are combined rather than layered independently. A retractable-roof stadium on a closed-roof day is modeled differently than on an open-roof day. An indoor park like Tropicana Field is almost entirely immune to weather, while Wrigley Field's projections can shift significantly based on the day's wind pattern. The model handles these interactions rather than applying separate corrections.

Umpire Adjustments

Home plate umpires have measurable and persistent tendencies in their strike zone interpretation. Some consistently call a wider zone, which suppresses walks and slightly depresses run scoring. Others run tight zones that elevate pitch counts and put more runners on base. Obsidic maintains umpire profiles and applies a small but statistically meaningful adjustment when the day's assignments are known.

In Practice Environmental adjustments typically shift projections by 0.5 to 2 runs per game in extreme cases (Coors Field in July, Wrigley with wind blowing out). On most days, in most parks, the effect is more subtle but "subtle" compounded across thousands of projections is the difference between a calibrated model and a biased one.

Park Factor Impact Range — Select Venues

Illustrative park effects. Wrigley's range reflects wind variability. Weather adjustments are applied game-by-game on top of base park factors.

Game Simulation

This is where everything converges. Each game is simulated thousands of times using Monte Carlo methods. Within each simulation, a complete nine-inning game is played out. Plate appearance by plate appearance, with the game state evolving naturally after each event.

The simulation begins by setting the lineups and initiating the first at-bat. The matchup engine provides the probability distribution for that specific batter-pitcher combination, adjusted for environment. A random draw determines the outcome. If the batter singles, a runner is placed on first. The next batter steps up, and the process repeats but now with the game state updated. A runner on first changes the run-expectancy context. The count of outs determines when innings change. The score determines whether extra innings are needed.

Why Simulation Over Formulas

A formula-based approach can predict a team's win probability or expected run total, but it can't naturally capture the cascading dependencies within a game. Simulation can. When a leadoff batter reaches base, the simulation doesn't just add a fraction of a run to the projection, it plays out the subsequent at-bats with a runner in scoring position, where each outcome has different consequences than it would with the bases empty.

This bottom-up approach also produces every derivative metric for free. Win probability, run distribution, over/under probabilities at any line, first-inning scoring rates, individual player stat projections, all emerge from the same set of simulations. There's no risk of these metrics contradicting each other because they all come from the same underlying game-play.

Baserunner Advancement

When a hit or out occurs with runners on base, the simulation needs to determine how runners advance. A single with a runner on second doesn't always score the runner, it depends on the runner's speed, the outfielder's arm, and the game situation. Obsidic models these advancement probabilities using empirically derived rates that account for the type of hit, the base-out state, and runner characteristics. These rates are calibrated against actual advancement outcomes rather than assumed.

5,000

Simulations per game

~70

Plate appearances per sim

~140K

Events simulated per game

<8s

Runtime per game

Example: Simulated Run Distribution (5,000 games)

The full distribution — not just the mean — is what allows O/U probabilities at any line.

Calibration & Backtesting

A model that hasn't been rigorously tested against historical outcomes is just a collection of assumptions. Obsidic is continuously backtested against completed seasons to measure accuracy, identify systematic biases, and calibrate outputs.

Calibration is the process of ensuring that when the model says a team has a 70% chance of winning, that team actually wins about 70% of the time. This sounds simple, but raw simulation outputs are almost never perfectly calibrated. Models tend to be overconfident or underconfident at certain ranges. We apply a post-simulation calibration layer, tuned through optimization against historical results, that compresses or expands probabilities to match observed frequencies.

What We Measure

Winner prediction accuracy alone is insufficient for evaluating a projection system. A model that picks every home team at 51% would have reasonable accuracy but terrible calibration and no practical utility. Obsidic tracks multiple metrics across backtests:

Brier Score measures the accuracy of probabilistic predictions. It penalizes confident wrong predictions more than tentative ones. Calibration curves compare predicted probabilities to observed outcomes across the full confidence spectrum. Total runs bias catches systematic over- or under-projection of scoring. Monthly breakdowns reveal whether accuracy is consistent across the season or degrades at certain points.

2025 Season Results

Every number below comes from leveraging this model through the full 2025 regular season — 2,416 games, every team, every park, April through September. These are not cherry-picked windows or in-sample fits. The model ran forward on each day using only data available before that day's games.

64.5%

Winner accuracy (2,416 games)

71.7%

High-confidence picks (≥60%)

0.22

Brier score (lower is better)

−0.19

Run total bias (near-zero)

The model maintained consistency across the full season, with June posting the strongest month at 68.0% winner accuracy. When the model expresses high confidence (≥60% win probability), it converts at 71.7% across 1,267 such picks; a meaningful edge over both the baseline and the betting market.

Over/under projections show similar strength. On games where the projected total diverged significantly from the posted line (±1.5 runs or more), the model's directional call was correct 65.3% of the time across 966 games. Overall O/U accuracy at common lines ranges from 58.5% to 61.0%.

Accuracy Charts — 2025 Season

Monthly Winner Accuracy

Games per month shown below each bar. Green = best month.

Projected vs Actual Run Totals by Month

Average runs per game. August under-projection (−0.66) was the largest monthly bias.

Calibration Curve — Confidence vs Actual Win Rate

A well-calibrated model tracks the diagonal. Obsidic's 70%+ tier outperforms its own confidence, the model is slightly conservative at the top end.

Over/Under Accuracy by Month (8.5 Line)

Consistent above 56% every month. September strongest at 60.7%.

Player Prop Accuracy — Hit Picks by Confidence Tier

Actual hit rate vs model confidence threshold. Sample sizes shown below each bar. Based on 40,799 player-game validations.

Transparency Note These results are from genuine 2025 actuals vs model. We publish these numbers because we believe transparency builds trust. If our accuracy ever degrades, we'll report that too.

Player Prop Projections

Because the simulation plays out every plate appearance for every player, individual stat projections fall out naturally. We don't run a separate model to predict whether a batter will hit a home run, we count how often he hits one across thousands of simulated games.

Batter Props

For each batter in the starting lineup, the simulation produces distributions for hits, home runs, total bases, strikeouts, walks, and RBIs. These aren't point estimates, they're full probability distributions. We can report not just "projected 1.2 hits" but the probability of 0, 1, 2, or 3+ hits, which maps directly to how prop bets are structured.

Expected Stats Correction

Raw simulation outputs are further refined through a correction layer that compares each player's actual surface-level results against their underlying quality metrics; how hard they're hitting the ball and at what angles, independent of whether those batted balls happened to land for hits. When a player's results are running significantly ahead of or behind their batted-ball quality, the correction nudges projections back toward what the underlying contact data supports. This catches both lucky streaks and unlucky slumps faster than waiting for the stats to regress naturally.

The strength of this correction is calibrated empirically; too aggressive and it overcorrects, too gentle and it fails to catch real regression. The current balance was determined through iterative backtesting across multiple seasons of player-game outcomes.

2025 Full-Season Validation

Player props were validated against 40,799 individual batter-game outcomes across the 2025 season. We tracked every projected plate appearance and scored it against what actually happened.

92.7%

Hit rate on ≥80% confidence picks

87.6%

K rate on ≥80% confidence picks

67.1%

HR rate on ≥25% confidence picks

40,799

Player-games validated

When the model says a batter has an 80% or better chance of recording a hit, that batter gets a hit 92.7% of the time, across 5,363 such predictions. Strikeout projections show similar reliability, and even home run calls, the hardest prop to predict in baseball, hit at 67.1% when the model signals 25% or higher probability.

Pitcher Props

Starting pitchers receive projections for innings pitched, strikeouts, earned runs, and quality start probability. The simulation tracks these naturally as the game unfolds, a pitcher's earned runs accumulate from the runs scored against him during his simulated outing, and his strikeout total reflects the actual matchup-by-matchup outcomes across the lineup.

First-Inning & First-Five Projections

The first inning is a distinct environment. Starting pitchers behave differently in the first as some are notoriously slow starters, while others are at their sharpest before fatigue sets in. Batters face the starter at his freshest arsenal but also get their first look of the day.

Obsidic maintains separate first-inning performance profiles for pitchers, built from historical first-inning-specific data. These profiles drive the simulation's first-inning outcomes, producing YRFI (Yes Run First Inning) and NRFI probabilities that reflect the actual tendencies of the pitchers involved rather than applying a blanket league-average rate.

The simulation also produces first-five-inning (F5) totals (the projected run total through the fifth inning), before bullpens enter. F5 projections hit 59.4% accuracy on over/under calls at the 4.5 line across the full 2025 season, with a near-zero bias of +0.18 runs. Because F5 projections isolate starting pitching matchups without bullpen noise, they can offer cleaner edges than full-game totals.

Edge Detection

Obsidic is a research and analytics platform, not a picks service. Our projections are built for education and independent analysis. Any edges surfaced are informational, not recommendations to wager.

With that in mind, a projection is only valuable if it can be compared against something. Obsidic integrates with major sportsbook odds in real time, converting some posted lines into an implied probability and comparing it against our model's calibrated output.

When our projection diverges meaningfully from the market, we flag it as a potential edge. The size of the edge is expressed as expected value (EV), the theoretical return per dollar wagered if our model's probability is correct. Not every edge is created equal: a small edge on a high-confidence projection is more reliable than a large edge on a coin-flip game, and our output reflects this distinction through a tiered rating system that accounts for both edge size and projection confidence.

Edge detection runs across all market types: moneylines, spreads, game totals at multiple lines, first-inning scoring, first-five-inning totals, and individual player props including hits, home runs, total bases, and strikeouts. Each flagged play includes the model's probability, the implied market probability, the edge percentage, and the expected value, giving you the full picture for research purposes.

Important Caveat We want to reiterate the importance of understanding that the presence of an edge in our model doesn't guarantee a profitable bet. Sportsbooks employ sophisticated modeling of their own, and the true probability of any game falls somewhere in the uncertainty between all available models. What we provide is a well-calibrated second opinion, one built on transparent, tested methodology.

Example: How Edge Detection Works

Honest Limitations [Work in Progress]

No model captures everything. Being transparent about what we don't model is as important as explaining what we do.

What we're still improving

Bullpen granularity. The simulation currently models the starting pitcher in detail but uses aggregate bullpen quality for relief innings. Since bullpens throw roughly 35–40% of total innings, this is an area where more granular pitcher-by-pitcher modeling would improve projections, particularly for totals and late-game win probability.

In-game managerial decisions. Pinch-hitting, intentional walks, defensive shifts, and bullpen sequencing are all influenced by game state in ways that are difficult to model pre-game. The simulation uses probabilistic rules for common decisions but doesn't attempt to simulate specific managerial tendencies.

Defense. Team defensive quality affects BABIP (batting average on balls in play) and is partially captured through pitcher profiles, but we don't yet model individual fielder positioning or defensive runs saved at the player level.

Early-season and small-sample players. The model handles rookies, call-ups, and players returning from injury by regressing toward league-average baselines. This is conservative by design as it prevents wild projections from thin data, but it also means breakout players aren't captured quickly. As sample sizes grow through the season, individual signals strengthen and this limitation fades.

The irreducible floor

Baseball is the most random of the major professional sports. The best team in baseball loses 40% of its games. A perfectly calibrated model would still be "wrong" on roughly one in three games, and no amount of modeling sophistication will change that. What a good model does is identify where the probabilities are off, not eliminate the uncertainty, but measure it honestly.

Our Commitment Obsidic is under continuous development. The pipeline, models, and calibration are updated throughout the season as more data becomes available and as we identify areas for improvement. Backtesting results are published transparently, including the numbers that don't look impressive, so you can evaluate our track record on your own terms.

Our Methodology

Philosophy

The Projection Pipeline

Player Ability Models

Separating Skill from Noise

Handling Limited Data

The Matchup Engine

Platoon Dynamics

What the Matchup Engine Produces

Environment & Park Factors

Weather Modeling

Park Factors

Umpire Adjustments

Game Simulation

Why Simulation Over Formulas

Baserunner Advancement

Calibration & Backtesting

What We Measure

2025 Season Results

Accuracy Charts — 2025 Season

Player Prop Projections

Batter Props

Expected Stats Correction

2025 Full-Season Validation

Pitcher Props

First-Inning & First-Five Projections

Edge Detection

Honest Limitations [Work in Progress]

What we're still improving

The irreducible floor