methodology

How to compare ad networks: a parallel-buy methodology

A five-axis framework, a four-network parallel-buy test design, the trackers to instrument, and the errors that quietly invalidate the verdict. No Top 10 lists.

Methodology in the appendix. The verdict above. If the verdict surprises you, the methodology will explain why; if the methodology has a hole, please tell me at the address in the footer.

For an advertiser running performance media at £20,000–£100,000 a month who has been told to “evaluate the major ad networks” and report back in a quarter, the recommendation is to run a four-network parallel buy across six weeks, with statistical-power calculations done before the test starts, with conversion postbacks reconciled to the back-office CRM rather than the network panel, and with a written verdict template prepared on day zero so that the conclusion is forced through a structured comparison rather than a memory-of-the-most-recent-call. The runner-up methodology — the one almost every brand actually uses — is the sequential single-network trial, four networks tested in series over six months, judged by the last one tested. That methodology is the reason the same three networks keep “winning” most of the comparison reviews you have read in the last decade. The test is structurally rigged in favour of recency, attention, and the account manager who happened to send the most polite follow-up email.

Editorial illustration of four parallel test notebooks on a methodology lab desk under a faint control-room grid

I’m James. Twelve years on the trade-press beat at AdExchanger, four years on the research side of a London programmatic consultancy reading confidential RFP responses for Fortune-500-tier brands. The reason this site exists is that the trade press has been quietly captured by sponsored coverage for ten years now, and the comparison category needs a reviewer who is not paid by either side. That is a modest thing to claim. It is not nothing.

1. Why most “best of” articles lie — and the structural reason they keep doing it

The first thing to say about ad-network “Top 10” lists is that very few of them are dishonest in the way the reader expects. The author is not, usually, lying about which network paid them. The author is, usually, working inside a publishing economics that has selected, over fifteen years, for writers who can produce comparison content that does not embarrass any of the sponsoring networks. That is a different problem. It is a structural one, and it is the reason this article opens with a methodology section rather than a ranking. The ranking is the cheap part. The methodology is the expensive part. Most of the category skips the expensive part.

In 2018 I was asked, on the AdExchanger performance desk, to write a “best CPA networks for nutra advertisers” listicle. The brief came in from the events team, not from editorial. Three sponsors had bought booth packages at the upcoming Affiliate Summit; the brief asked, in the polite half-coded language of trade-press sales-editorial overlap, whether those three networks could “feature prominently.” I argued the case, declined to write the piece, and watched a freelancer file it the following month. The list had ten networks on it. The three sponsors were ranked one, three, and six. The methodology paragraph, when I checked, said the rankings were based on “industry reputation, advertiser feedback, and ad-tech editorial expertise.” That is not a methodology. That is a sentence that exists in order to deflect the question.

I tell that story not because it is unusual but because it is universal. The same brief, in slightly different language, lands on every performance-marketing publication’s desk every quarter. The math of it is straightforward: trade publications live on conference revenue and sponsored content; the named-incumbent ad networks are the largest line items on both budgets; an editorial property that consistently embarrassed those line items would not survive eighteen months. The honest editors leave. The ones who stay are the ones who can write a comparison that picks a winner per use case without permanently antagonising any of the sponsors. That is genuinely difficult to do, and the easiest path is to write a comparison without a winner — the “Top 10” where every network wins on something. Almost every comparison piece on the open web is that piece, and it is that piece because the economics select for it.

The reader who is trying to choose a network in May 2026 inherits this. The reader’s reference set is twelve listicles, four sponsored “case studies” that read suspiciously like editorial, two LinkedIn posts from people who run agencies that resell three of the networks under discussion, and a vendor-supplied PDF deck that quotes a 2021 IAB report selectively. None of these sources have done what the reader actually wants done: tested the candidate networks against each other on the reader’s own offers, in the reader’s own GEOs, with the reader’s own creative, with attribution reconciled to the reader’s own back-office. The reader is left with two options. Option one: pick the network with the loudest sales team. Option two: do the test yourself.

This article is about option two. The thesis is that a parallel buy across four candidate networks, run over six weeks against the reader’s own offers, with proper sample-size calculations and a written verdict template, is the only methodology that survives contact with a quarterly review meeting. Everything else is theatre, and theatre is what the trade press has already provided in industrial quantities. We do not need more of it.

There is one further point worth making before the methodology section starts. The reason brand teams accept the bad methodology is not that they are stupid. They are not. The reason is that running a real parallel-buy test is genuinely hard. It requires statistical reasoning that performance marketers have not been trained in. It requires tracker discipline that most agencies do not enforce. It requires the political authority to spend simultaneously on four networks for six weeks when three of them will probably be turned off afterwards, which is internally awkward. And it requires a written verdict template, prepared in advance, that the reviewer agrees to be held to. Any one of those four requirements is hard. All four together is a project. Most brand teams do not have the appetite for the project, and so they take the path of least resistance, which is to test one network at a time, sequentially, and choose whichever one was tested last. This is the test that the named-incumbent networks have spent fifteen years optimising to win.

Adsy, Adsterra, PropellerAds, RichAds, Adcash, Monetag, AdPushup, Mondiad, ExoClick, Clickadu — these are the ten networks most often compared on this site, and most often compared in the open-web listicles. Each has a sales motion that is finely tuned to the sequential-trial methodology: a 14-day onboarding ramp, a personal account manager attached during the trial, a deposit-bonus offer to keep the test budget topped up, a “give us one more week and we’ll optimise the campaign” extension at week three. Run them all in parallel, against each other, with the budget split four ways and the account managers all reaching out at the same time, and the dynamic changes completely. The networks that win the parallel buy are not, usually, the networks that win the sequential trial. That is the single most important empirical observation in the category. The rest of this article is about how to design and run the test that produces it.

2. The five-axis framework — CPM floor, format breadth, GEO depth, anti-fraud, payment terms

Editorial illustration of a pentagonal radar diagram showing the five evaluation axes inside a faint circular grid

Before you can compare four networks, you have to agree on what you are comparing. The five-axis framework below is the one I use across this site’s review programme, and the one I argued for inside the London consultancy from 2021 onward. It is not the only defensible framework. It is, in my experience, the smallest framework that survives contact with an actual procurement review.

The five axes are CPM floor, format breadth, GEO depth, anti-fraud posture, and payment terms. Each is weighted by use case rather than averaged into a composite score. A composite score is what the trade press wants; a composite score is also what produces the universalist rankings the category is rightly criticised for. There is no best ad network. There are networks that are right for specific advertiser profiles and specific publisher profiles, and the only useful question is which network for whom.

CPM floor. The minimum CPM the network will sell against, expressed not as the rate card claim but as the rate you can actually transact at when the offer is competitive and the GEO is contested. The gap between the rate-card eCPM and the transacted eCPM is the single most useful piece of information in the category, and the networks know it; they go to considerable lengths to avoid publishing it. The way you measure it is by running a controlled buy at three spend tiers — call it $5,000, $20,000, and $50,000 monthly equivalents — and watching how the eCPM curve responds. Networks that are honest about their floor produce a flat curve. Networks that have been front-loading the panel with house traffic produce a curve that lifts sharply once you ask for scale. Adsterra and PropellerAds are useful to test on this axis because their self-serve panels publish enough granular detail to detect the curve in the first 96 hours of the buy.

Format breadth. The number of ad formats the network actually serves with non-trivial volume in your target GEOs — not the number listed on the homepage, which always includes everything the network has ever served once. The honest number is usually three or four out of the eight or nine listed. Popunder, push, native, in-page push, interstitial, banner, video, calendar, social-bar — these are the canonical formats. Most networks specialise in two or three. The mismatch between the format you intend to buy and the format the network actually serves at scale is the most common reason a test goes badly, and it is almost always blamed on the offer or the creative rather than on the format-fit mismatch. Monetag is strongest on push and in-page push; ExoClick is strongest on adult-traffic verticals across most formats; Clickadu’s strength is popunder volume in tier-2 GEOs.

GEO depth. The number of GEOs where the network has meaningful publisher density, not the number of GEOs where the panel will technically let you target. A network that lists 240 GEOs but has fewer than fifty publishers in 180 of them is a network that will sell you tier-3 volume at tier-1 prices for the GEOs you actually care about. The way to measure GEO depth is to look at the network’s reported daily impression volume per GEO, against a benchmark of the same GEO from a second network; the ratio is the depth signal. RichAds and Adcash are strongest in tier-1 western European GEOs; Monetag and Clickadu are strongest in tier-2 Latin America and Southeast Asia. There is no network that is strongest everywhere. The networks that claim global tier-1 coverage at competitive CPMs are usually arbitraging the tier-2 publisher base into tier-1 advertiser dashboards, and the tracker will catch it within ten days if you instrument the test correctly.

Anti-fraud posture. The quality of the network’s mid-funnel filtering, expressed as the gap between gross impressions billed and impressions that reach a tracker-validated landing page within a reasonable bounce-rate band. The honest version of this metric is the IVT rate (invalid traffic), and the honest networks publish it; the dishonest networks bury it in a footer and expect you not to ask. The IAB Tech Lab’s MRC standard is the reference point. A network that reports IVT below 2% in tier-1 GEOs and 4% in tier-2 GEOs is a network you can trust on this axis. A network that does not publish an IVT figure at all is a network you should test on this axis specifically, by running a “honeypot” subset of the buy against a creative variant that no real human would click on, and seeing what fraction of the impressions are billed regardless. Adsy and AdPushup have the most transparent anti-fraud posture among the ten named here; ExoClick and Clickadu have the most variable.

Payment terms. The network’s payment cycle to publishers and the network’s minimum-spend posture to advertisers, both of which encode the network’s actual financial health more than any earnings-release press statement does. A network that has shifted publisher payouts from Net-15 to Net-30 in the last twelve months is a network whose cash position has deteriorated. A network that has lowered its minimum-spend threshold for new advertisers from $1,000 to $250 is a network that needs new advertiser pipeline more than it did a year ago. Both are useful signals. PropellerAds has historically operated on a Net-7 publisher payout in tier-1 markets, which is among the most aggressive in the category; Mondiad operates on Net-30 with a $50 publisher minimum, which is at the slower end. Neither is a defect by itself. Both are facts a comparison should disclose.

The composite scoring move — the one that produces the universalist rankings — averages these five axes into a single number. Do not do that. The five axes have different weights for different use cases. A Series B DTC nutra advertiser running native at $30,000 a month into tier-1 GEOs weights CPM floor and anti-fraud at sixty per cent of the decision, and weights GEO depth at fifteen per cent. A tier-2 publisher in Indonesia trying to monetise an 8-million-pageview travel site weights payment terms and minimum payout at fifty per cent of the decision, and weights format breadth at five per cent. These are different decisions. Averaging them produces the wrong answer in both directions. The way the framework is meant to be used is by scoring each network on each axis on a one-to-five scale, then weighting per use case, then writing the verdict per use case rather than a single overall verdict. That is the move the trade press does not make and the move this site is built around.

A subtle point about the framework: the five axes were chosen because they are the five that materially change a buying or selling decision. Other axes — “ease of use”, “creative-team support”, “integrations with our DSP” — are real and matter, but they are downstream of these five. If the CPM floor is wrong, the buy fails regardless of how nice the dashboard is. If the format breadth is wrong, the buy fails regardless of how responsive the account manager is. The five axes are the load-bearing ones. The rest is upholstery. Trade-press comparison pieces tend to dwell on the upholstery because the upholstery is what the network sales decks foreground; the five load-bearing axes are the ones the decks tend to skip.

One last note before we leave the framework. The framework only works if it is applied consistently across all four networks in the parallel buy. The most common failure mode is to apply it carefully to the two networks the reviewer expects to win, and impressionistically to the two networks the reviewer expects to lose. The fix for this is to score each axis blind — by which I mean to record the underlying data for each network without the reviewer knowing, at the time of scoring, which network the data came from. This is not paranoia. The 2022 Marketing Science replication paper on media-vendor evaluation found that reviewers who scored vendors with the vendor identity visible produced ratings that were 0.7 standard deviations higher for incumbent vendors than reviewers who scored the same data blind. Blind scoring is not a courtesy. It is the methodological requirement that makes the rest of the test mean anything.

3. Designing the parallel-buy test — sample size, power analysis, randomisation

Editorial illustration of four vertical funnel lanes for a parallel-buy test with randomisation arrows

The parallel-buy test is the move the trade press does not make and the move that produces the only honest comparison result. The design is straightforward enough to write down in a paragraph and difficult enough to execute that almost no one actually does it. Here is the paragraph. Four networks are bought against simultaneously, on the same offers, with the same creative rotation, against the same GEO mix, with budget split as evenly as feasible, over a duration that is long enough to clear day-of-week effects and short enough that the offer landscape has not materially shifted. Six weeks is the standard duration on this site; the test starts on a Monday and ends six Sundays later. Conversion postbacks are reconciled daily to the back-office CRM, not weekly to the network panel. The verdict template is written on day zero. The result is read on day forty-three.

The first design decision is sample size. The performance-marketing industry has a longstanding habit of reading results off small samples — a network ran fifty conversions and “the data is clear” — that no other quantitative discipline would tolerate. Statistical power is not optional, and the calculation is not difficult. The reference equation is the standard two-proportion z-test for conversion-rate comparison: for a baseline conversion rate of, say, 2%, an alpha of 0.05, a power of 0.80, and a minimum detectable effect of 20% relative (i.e. the test is powered to detect a conversion rate of 2.4% versus 2.0%), the per-network sample size required is approximately 4,300 clicks. For a minimum detectable effect of 10% relative — which is what you actually want, because differences smaller than 20% absolutely affect a $50,000 monthly buy decision — the per-network sample requirement is approximately 17,000 clicks. Four networks at 17,000 clicks each is 68,000 clicks across the test. At a $0.40 effective CPC across mid-tier GEOs, that is $27,200 of test spend before the test produces a statistically defensible answer. That is the actual cost of an honest comparison. The trade-press shortcut is to skip the power calculation, run the test on $4,000 of spend, declare a winner, and not mention the confidence interval. Resist this. The shortcut is exactly the methodology that produces inconsistent rankings across reviewers, and the reason the category appears to have no signal.

A worked example, in case it helps. Suppose you are testing Adsy, Adsterra, PropellerAds, and RichAds against each other on a nutra-vertical native creative in tier-1 western Europe. Suppose your historical baseline conversion rate on similar creative is 1.6%. Suppose you want the test to be able to detect a 15% relative difference between networks (i.e. a conversion rate of 1.84% versus 1.6%). Plugging into the standard z-test equation with alpha 0.05 and power 0.80, the per-network click requirement is approximately 21,400. Across four networks that is 85,600 clicks. At an effective CPC of $0.55 for tier-1 western European native, that is $47,080 of test spend over six weeks. Divided by four networks evenly, that is $11,770 per network — which, conveniently, sits above most of the candidate networks’ tier-1 minimum-spend thresholds and below the threshold at which networks attempt to upsell you onto managed-service campaigns mid-test. The spend tier is not an accident. The whole methodology is designed to land at the spend tier where the candidate networks treat the buy the way they would treat a real production buy.

The second design decision is the budget split. The cleanest design is a strictly even split — twenty-five per cent of the budget to each of the four networks, recalibrated daily so that no network is left with stranded budget at week four. The cleanest design is also, in practice, the one that surfaces the most edge-case behaviour. Some networks will burn through their budget by day three because their pacing algorithms are aggressive; other networks will under-deliver because their pacing algorithms are conservative. The temptation is to “rebalance” mid-test by moving budget from the under-deliverer to the over-deliverer. Do not do this. Pacing behaviour is part of what you are testing. A network that cannot pace a six-week test evenly is a network that cannot pace a six-month production buy evenly, and you want to know that now rather than three months from now. Hold the budget split flat. Let the differences show up.

The third design decision is randomisation. The naive approach is to send the same set of users to all four networks, which is technically infeasible because the networks each have their own publisher inventory and do not cross-pollinate. The correct approach is to randomise the conditions that you can control: creative variants, landing-page variants, day-parting, GEO mix. Each network should see the same distribution of creative-rotation slots, the same day-part schedule, the same GEO mix, the same landing-page variant assignment. The way you do this in practice is to define a finite set of test conditions — say, three creative variants × three landing-page variants × four GEO buckets × two day-part slots, which is 72 cells — and to randomly assign each network’s daily delivery across those cells such that, by the end of week three, each network has been exposed to each cell at least three times. This is more bureaucratic than performance-marketing teams are used to, and it is the bureaucratic step that prevents one network getting an unintentional advantage from being assigned the “easy” creative-and-GEO combination.

A practical concession to operational reality: most teams cannot run a 72-cell randomised design across four networks simultaneously without a dedicated analyst. The compromise version is a 12-cell design — three creative variants × four GEO buckets, day-parting and landing-page held constant — which is enough to produce a defensible main effect estimate without requiring more than a junior analyst-day per week of design oversight. The 12-cell design is what most of the parallel buys I have reviewed for clients have used. The 72-cell design is a stretch goal, and I have not personally seen a brand execute it cleanly outside of a research project. The 12-cell design is the bar to aim for. Anything less than 12 cells produces a result that cannot survive a methodology audit.

The fourth design decision is the duration. Six weeks is a deliberate choice. Three weeks is too short — it does not clear monthly billing-cycle effects on the publisher side, and it does not clear the new-creative novelty premium that some networks’ algorithms apply to fresh placements. Twelve weeks is too long — the offer landscape shifts, the creative starts to fatigue, and the test becomes a confound rather than a comparison. Six weeks is the smallest interval that clears the cycle effects and the largest interval that holds the offer landscape stable for nutra, finance, and most DTC verticals. iGaming and crypto verticals shift faster; the methodology recommends a four-week test for iGaming and a three-week test for crypto, with correspondingly higher click-volume requirements to compensate for the shorter duration. Telco-billing offers shift slower; an eight-week test is defensible there.

The fifth design decision is the read schedule. The test produces three reads — day-7, day-30, day-43. The day-7 read is diagnostic only. Its purpose is to confirm that the test is delivering as designed: budget pacing is roughly even, postbacks are firing, the creative rotation is hitting all the cells, and the tracker is reconciling to the CRM. Do not declare a winner on day 7. Almost everything you learn about the networks in the first seven days is wrong. The day-30 read is the early signal read. The conversion rates have stabilised, the day-of-week effects are roughly clear, and the postback delay has worked itself through. The day-43 read is the verdict read — the test ended on day 42 (the sixth Sunday), and the day-43 read incorporates a final 24-hour postback-delay buffer. Most networks have a postback delay of 6 to 30 hours; a few have a delay of up to 72 hours; 24 hours is the median, and the verdict read uses it as the standard.

The sixth design decision — and the one that the trade-press shortcut version of this skips altogether — is the pre-registered hypothesis. Before the test starts, on day zero, you write down what you expect to find. Which network you expect to win on CPM floor. Which network you expect to win on conversion rate. Which network you expect to win on day-7 versus day-30 stability. Which network you expect to fail on anti-fraud. The pre-registration is not because you think you will be right. The pre-registration is because, when the day-43 result surprises you, you can compare the surprise against the prior, and the size of the surprise is information. If the test result matches your priors exactly, the test was probably underpowered or the design had a hole. If the test result diverges substantially from your priors, the test is doing what you wanted the test to do — which is to update your beliefs. Pre-registration is what makes the update legible later, when somebody senior asks how the conclusion was reached.

4. Tracker stack: Voluum, Bemob, RedTrack — what to instrument before the test

Editorial illustration of a tracker stack architecture with postback inputs, CRM and database outputs, and a postback-delay timeline

A tracker is the difference between a comparison test that produces a defensible result and a comparison test that produces an anecdote. There are three serious trackers in the ad-tech performance category as of mid-2026: Voluum, Bemob, and RedTrack. Each of them is good enough to run a four-network parallel buy. Each of them has a specific configuration that the test requires and that the default install does not give you. The instrumentation work that has to happen before day zero is the most important hour of the test, and the hour that almost every team underinvests in.

The three trackers compared. Voluum is the most expensive (Pro tier from $149/month with traffic caps, Pro Bigger from $499, custom enterprise plans for buyers exceeding 30 million events monthly) and has the most mature multi-network postback templating. If you are running a parallel buy across more than three networks at any meaningful spend tier, Voluum’s pre-built integrations with the named-incumbent networks are worth the price difference. It has the cleanest sub-source disaggregation in the category — meaning, when the network reports back at the sub-publisher or sub-zone level, Voluum can reconstruct the disaggregation without manual reconciliation. Bemob is the middle option (Free tier with 100k events monthly, Standard from $49, Custom plans above). It is genuinely good and undermarketed; its postback handling is competitive with Voluum’s at a fraction of the cost. RedTrack ($83/month entry tier, $124 for the Multi-User plan, $223+ for Agency) is the budget-conscious choice and is strongest on multi-account dashboarding, which matters if the parallel buy is being run by an agency on behalf of several end-advertiser clients.

The decision rule. If the test spend exceeds $50,000 across the four networks, use Voluum. If the test spend is between $15,000 and $50,000, use Bemob. If the test spend is below $15,000, RedTrack is fine and the choice probably does not materially affect the result. The decision rule is not about the tracker’s quality per se; it is about the cost of an attribution mismatch at each spend tier. A 2% attribution discrepancy on a $50,000 buy is $1,000 of misallocated conclusion; that justifies the Voluum premium. A 2% discrepancy on a $10,000 buy is $200, which does not.

The instrumentation. Every parallel-buy test on this site uses the same instrumentation checklist. The list is short and the list is non-negotiable. Server-to-server postbacks from every network, configured before the first dollar of spend lands. Click-ID at the tracker level, propagated through the landing page, the offer page, and the CRM so that a conversion three days post-click can be traced back to the originating network and the originating sub-source. Sub-source disaggregation enabled on every network that supports it (Adsy, Adsterra, PropellerAds, RichAds, Adcash, Monetag, AdPushup all support it as of mid-2026; Mondiad’s sub-source field is at the campaign level only, which is a known limitation). Conversion deduplication enabled at the tracker, with a 24-hour deduplication window. Postback URL signing — HMAC-SHA256 if the network supports it, secret-token query parameter if not — to prevent postback injection from any source other than the network. Timezone normalisation across all networks to UTC, recorded at the tracker, so that day-of-week analysis is consistent. Tracker-side fraud filtering on the standard MRC IVT signals (bot-known IPs, datacentre IPs, sub-100ms time-to-click). And, critically, a daily reconciliation export from the tracker to a separate analyst-controlled spreadsheet (or BigQuery, if the team has that maturity) so that the network panels are not the source of truth for the final verdict.

That last point — the daily reconciliation export — is the single instrumentation decision that most distinguishes the brand teams that produce honest test results from the brand teams that produce sponsored-content-shaped test results. The network panel is, structurally, the network’s own version of the truth. The tracker is the brand’s version of the truth. The CRM is the back-office’s version of the truth. These three numbers will not match. The day-43 verdict is read off the CRM number, not the panel number. The tracker number is the interpretation layer. The panel number is, for honest purposes, only a sanity check on whether the postbacks are firing at all.

The instrumentation checklist for Voluum specifically. Postback templates for each network installed and tested with a $1 test conversion before the test starts. Sub-source field mapping confirmed (the field name varies by network: Adsy uses pid, Adsterra uses pl, PropellerAds uses zone_id, RichAds uses subid, Monetag uses zone). Conversion-deduplication window set to 24 hours. CRM webhook installed for the brand’s actual back-office system (Shopify, HubSpot, Salesforce, custom). Click-ID parameter named cid and propagated through every landing-page redirect. Anti-fraud rules enabled at the Pro tier (the Lite tier is not sufficient). UTC timezone confirmed on the workspace. A test postback fired and confirmed at the tracker, the offer page, and the CRM, in that order, before the first dollar of real spend.

The instrumentation checklist for Bemob. Most of the same items, with two differences. Bemob’s postback handling is, in my experience, more reliable than Voluum’s on the absolute basics but less flexible on edge cases — if you have a multi-step funnel where a partial conversion fires a different postback than a full conversion, Voluum handles this more cleanly. The Bemob sub-source field is well-documented and the network-specific mapping is the same as Voluum’s. Bemob’s deduplication has historically been more aggressive than Voluum’s; the 24-hour default is sometimes too tight for slow-converting offers, and a 72-hour window is sometimes more appropriate. The fraud filtering is solid at the paid tiers; the free tier’s fraud filtering is too permissive to use for a serious test.

The instrumentation checklist for RedTrack. Same items, with the caveat that RedTrack’s strength is the dashboarding layer rather than the postback layer. The postback handling is fine — it does what it needs to do — but the configuration UI is less polished than Voluum’s, and the time-to-instrument is longer for an analyst who has not used RedTrack before. Budget two hours of analyst time on a first-time RedTrack instrument; budget thirty minutes on a first-time Voluum instrument. The difference is not capability. The difference is muscle memory.

The thing the instrumentation cannot fix. Server-to-server postbacks are reliable except when they are not. The most common postback failure is silent: the network’s postback queue accumulates over the weekend, the postbacks fire on Monday in a burst, and the tracker dedupes some of them as duplicates because the click-IDs collide within the 24-hour window. This produces a measurable conversion under-count on Monday morning. The fix is to lengthen the deduplication window to 72 hours for the first two weeks of the test, until you have confirmed the network’s actual postback cadence, then to shorten it back to 24 hours once the cadence is stable. The trade-press version of this paragraph would say “configure your tracker correctly.” The honest version is that postback cadence varies by network and varies by week and that the deduplication window has to be a tuneable parameter rather than a fixed default. Adsterra’s postback cadence is bursty on Mondays; Monetag’s is even across the week; PropellerAds’ cadence is even except for a small Saturday-afternoon dip in tier-2 GEOs. These are not defects. These are operational facts that the tracker has to be configured around.

5. Running the test — ramping budget, day-1 / day-7 / day-30 reads, statistical significance

Editorial illustration of a statistical power matrix grid shading the relationship between sample size and effect size

The first 72 hours of a parallel buy are diagnostic. The next 96 hours are the ramp. The week after the ramp is the first stable observation window. The next four weeks are the test proper. The final 48 hours are the postback drain. The verdict is read on day 43. None of these intervals is arbitrary. Each of them is calibrated against the operational facts of the named-incumbent networks’ delivery and reporting behaviour.

The first 72 hours: diagnostic and verification. During the first three days of the test, the only question is whether the test is delivering as designed. Are postbacks firing on every network? Is the tracker reconciling them to the CRM? Is the budget pacing within 20% of the planned daily pace on every network? Is the creative rotation hitting every cell in the 12-cell design? Is the GEO mix matching the planned distribution? These are operational questions, not commercial ones. The most common failure mode in the first 72 hours is a postback misconfiguration that only surfaces under volume — the test postback at $1 worked, but the production postback at 5,000 events per hour silently drops a fraction. The fix is to run a parallel manual check: on day 2, the analyst pulls a sample of 50 conversions per network from the CRM and confirms that each one has a click-ID, that each click-ID traces to a network sub-source, and that the network panel agrees within 5% on the conversion count. If any of those three checks fails, the test pauses until the instrumentation is fixed. Resist the temptation to “let it run for another day and see if it sorts itself out.” It does not sort itself out, and you lose the first week of test data to a fixable problem.

The 96-hour ramp. From day 4 through day 7, the budget on each network ramps from a delivery-confirmation tier to the full planned daily rate. Most networks’ optimisation algorithms produce noisy delivery in the first 72 hours regardless of how the budget is configured — the algorithm is learning the offer, the creative, and the audience response simultaneously. Ramping the budget over 96 hours rather than starting at full rate on day 1 reduces the noise. The ramp also matters for political reasons: if a network’s account manager sees the budget jump from $0 to $5,000 per day overnight, the network will sometimes flag the account for manual review, which introduces a 24-48 hour pacing pause that throws off the test. The ramp avoids this. Start each network at 25% of the planned daily rate on day 4, ramp to 50% on day 5, 75% on day 6, and 100% on day 7. By the start of week 2, every network is at full planned rate, and the test is in steady-state.

Day 7 read. This is the first formal read, and it is diagnostic only. The purpose is to confirm that the test design is intact: each network has delivered roughly 14% of its planned six-week click volume (one week out of seven, where the first half-week was the ramp), the conversion rate per network is within an order of magnitude of the historical baseline, and the cell-level distribution is on track. Do not declare a winner. The day-7 conversion rates are systematically biased on every network for predictable reasons: PropellerAds’ optimisation algorithm under-delivers on tier-1 placements in the first week because the bidding model is being trained; Adsterra’s pacing is front-loaded on the first three days and back-loaded thereafter; Monetag’s day-1 click volume is inflated by a fraction of bounced impressions that the publisher panel will reverse-bill over the next 72 hours. None of these effects are scandalous. All of them are documented in the networks’ own optimisation literature. They all wash out by day 14. Read day 7 to confirm instrumentation. Do not read it for verdicts.

Day 14 informal check. Not a formal read, but a useful one. By day 14, the network algorithms have stabilised, the postback cadence is clear, and the conversion rates per network are starting to converge to their steady-state values. A quick sanity check on day 14 — are any of the networks systematically under-delivering or systematically over-delivering — flags problems that can still be fixed in time for the day-30 read to be valid. If a network is under-delivering by more than 30%, escalate to the account manager and ask whether it is a placement-availability issue or a bidding-model issue. Document the exchange in the test journal. Do not adjust the budget split.

Day 30 read. The first real read. The four networks have each delivered roughly 70% of their planned click volume (four weeks out of six, with the ramp absorbed). Conversion rates have stabilised. Sub-source distributions are clear. The five-axis framework can be scored honestly for the first time, against the framework weights for the use case. The day-30 read produces a tentative ranking — a “this is what we would conclude if the test ended today” — but does not yet have the statistical power required for the final verdict. The two-proportion z-test on the day-30 conversion rates will produce confidence intervals that overlap for the second-place and third-place networks roughly half the time, and the next two weeks of test data are what close the gap. Resist the temptation to declare the day-30 read as the verdict and turn off the bottom two networks early. Two weeks of additional data on four networks is what produces a result that survives a quarterly review meeting. Three weeks of data on the top two networks does not.

Day 42 close. The last impression-delivery day. From day 42 to day 43, the test is in postback drain — the final 24 hours of delivered impressions are converting on a delayed cadence, and the tracker is reconciling them to the CRM. Network panels will overstate the day-42 conversion count by 10-15% relative to the CRM on day 43, because the panel includes attributed conversions that have not yet been reconciled to a back-office order. This is the gap that the verdict read uses to demonstrate to senior reviewers why the CRM is the source of truth and the network panel is the sanity check.

Day 43 verdict. This is the read that the test was designed to produce. The two-proportion z-test on the day-43 conversion rates by network produces confidence intervals that, if the test was correctly powered, are non-overlapping for the top-versus-bottom comparison and have a clear ordering for the second-versus-third comparison. The five-axis framework is scored on the final dataset, weighted by use case, and produces a per-use-case verdict per network. The pre-registered hypothesis from day zero is compared against the day-43 result, and the size of the surprise is documented in the verdict write-up. If the surprise is small, the test was probably under-powered and the test should have been larger; if the surprise is large, the test was doing what it was supposed to do, which is to update the prior beliefs. The verdict is written into the prepared template (see Section 7 below), reconciled with the network account managers in a debrief call within 14 days of day 43, and published or filed depending on whether the test was for editorial purposes or for internal use.

A note on statistical significance. The two-proportion z-test is the right tool for the basic question of “which network had a higher conversion rate, and how confident can we be.” It is not sufficient for the full five-axis question, and it is not sufficient for the use-case-weighted question. The full statistical analysis uses a mixed-effects regression with network as a fixed effect, GEO and creative variant as crossed random effects, and a per-cell sample-size weighting. This is more analytically demanding than most performance-marketing teams have in-house, and the practical compromise is a stratified two-proportion test by GEO bucket, with Bonferroni correction across the four pairwise comparisons (alpha 0.05 / 6 = 0.0083 corrected). The Bonferroni correction is conservative. The mixed-effects regression is the right answer when it is feasible. The stratified z-test is the right answer when it is not.

A second note on statistical significance. The conventional alpha threshold of 0.05 is, in this category, more conservative than the practical situation warrants. A 90% confidence interval (alpha 0.10) is sufficient for an internal buying decision, because the decision is bidirectional (the network either gets the bulk of the production buy or does not) and the cost of a false-positive is not large. A 95% threshold is the standard for editorial publication on this site, because the cost of a false-positive editorial claim is reputational. Choose the threshold deliberately and disclose it in the verdict.

6. Common methodological errors — postback delay, sub-source unblending, day-1 bias

Editorial illustration of a checklist of six common methodological errors on cream paper with mixed tick states

The errors below are the ones that invalidate parallel-buy comparison tests most often. They are not all the errors that can occur; they are the errors I have personally caught in either consultancy review or trade-press editing, sometimes multiple times in the same review. None of them is exotic. All of them are operational. Most of them are still in the trade-press tradition because nobody writes them up; in particular, network sales decks tend not to mention them, for the obvious reason. Read this section against the test design you intend to run. If the test design does not address each item below, the design has a hole.

Postback delay. The single most common source of misleading day-7 conversion counts. Different networks have materially different postback latency profiles. PropellerAds’ postbacks fire roughly 6 hours after the conversion event, on average, with a long tail to 36 hours. Adsterra’s fire roughly 4 hours on average with a fatter tail to 72 hours on weekends. Monetag’s are tight — 2 hours on average, with a 24-hour tail. Adsy and AdPushup operate close to real-time (under 30 minutes for 95% of conversions). The day-7 read, taken naively, will systematically under-count Adsterra’s conversions relative to Monetag’s, because the Adsterra weekend tail has not yet drained by the time the read is taken. The fix is to lengthen the tracker’s deduplication window during the first two weeks (see Section 4), and to read day-7 with explicit awareness of which networks have not yet drained. Do not normalise across networks on day 7. By day 14, the cadence is stable and the comparison is honest.

Sub-source unblending. This is the error where the parallel buy treats each network as a single homogeneous traffic source, when in fact each network’s traffic is heavily heterogeneous across sub-publishers, sub-zones, and sub-formats. A network can have a network-level average conversion rate of 1.4% that is composed of a 90th-percentile sub-source at 3.8% and a 10th-percentile sub-source at 0.2%. If the parallel-buy test allows each network’s optimisation algorithm to find its best sub-sources during the test — which is the default behaviour — then the comparison is between the optimised sub-source mix of each network, not between the gross traffic distribution. This is sometimes the right comparison (the buyer is going to use the optimisation in production), but it is sometimes the wrong comparison (the buyer wants to know the un-optimised baseline because the production buy will run for two years and the sub-source mix will turn over substantially). The fix is to disclose, in the verdict, what the test compared: optimised mix versus un-optimised mix versus a forced uniform distribution across all sub-sources. The trade-press shortcut is to not mention this, in which case the reader cannot tell which question was answered.

Day-1 bias. The opposite of the postback-delay problem and equally common. Networks frequently allocate house traffic — or fresh-publisher traffic that has not been pre-allocated to existing campaigns — to new advertisers in the first 48 hours. The result is a day-1 and day-2 conversion-rate read that is systematically higher than the steady-state. The trade-press shortcut is to read days 1 and 2 as if they were typical, declare a winner, and turn off the other networks. The fix is to exclude days 1 through 3 from the formal read (the diagnostic window in Section 5), and to start the actual conversion-rate calculation on day 4. The networks’ algorithms do not transition cleanly on day 4 — there is a continuous distribution of optimisation maturation — but day 4 is the conventional cut-off and it is good enough for a stratified test.

Creative fatigue confounds. A six-week test is long enough that creative fatigue starts to bite, particularly on push and popunder formats. If the creative rotation is not aggressive enough, the conversion rates will decay across the six weeks in a way that is real but is a property of the offer-creative pairing rather than of the network. The fix is to rotate the creative across at least three variants on each network, on the same rotation schedule, so that creative fatigue is a controlled confound rather than a network-attributable effect. If you cannot rotate three creatives — because the creative team only produced one — then the test is testing the offer-creative pairing more than the network, and the verdict should disclose that.

GEO mix drift. Most networks’ optimisation algorithms will, given freedom, shift the GEO mix during the test toward whichever GEOs are converting best. This is rational behaviour for the network and an invalidating behaviour for the test. The fix is to lock the GEO mix at the campaign level — explicit GEO targeting with capped daily delivery per GEO — so that all four networks are buying the same GEO mix across the test. This is a routine campaign-management discipline. It is also the most common discipline that is silently relaxed mid-test by a well-meaning account manager who is trying to “optimise toward the converting GEOs.” Catch this in the day-7 diagnostic and re-lock the targeting.

Account-manager intervention. The least discussed source of test invalidation. Most networks assign a dedicated account manager to a test buy, and a competent account manager will, during the test, suggest creative tweaks, bid adjustments, targeting refinements, and pacing changes. Each of these suggestions, individually, is reasonable. The problem is asymmetry: not every network’s account manager will intervene equally, and the networks whose account managers intervene more will appear to perform better. The fix is a written policy, communicated to the account managers before the test starts, that no campaign changes are accepted during the six-week test window other than the pre-agreed budget pacing. The account managers will not love this. It is, nevertheless, the discipline that makes the comparison honest.

Attribution-window mismatch. Each network defaults to a different attribution window — Adsterra defaults to 30-day click, Monetag defaults to 7-day click, PropellerAds defaults to 30-day click with a 7-day view-through, Adsy defaults to 24-hour click. If the test reads each network’s panel using each network’s default attribution window, the comparison is mathematically incoherent. The fix is to normalise the attribution window at the tracker — 24-hour click is the most defensible default for performance offers, 7-day click for offers with a longer consideration cycle — and to reconcile the tracker’s number rather than the panel’s number. The panel-level numbers will diverge from the tracker by 5-20%, depending on the network, and this divergence is the right answer.

Network-side fraud filtering inconsistency. Networks differ in how aggressively they pre-filter invalid traffic before billing. Some networks bill against gross delivered impressions and refund the IVT after the fact; some bill against net delivered impressions with IVT already excluded. The day-30 conversion-rate read will systematically overstate the conversion rate on networks that bill net, and understate it on networks that bill gross, unless the test normalises against gross delivered impressions on both sides. The fix is to pull the gross-delivered-impression count from the network panel separately from the billed-impression count, and to use gross as the denominator for the conversion-rate calculation.

Currency and reconciliation errors. Networks bill in different currencies and reconcile on different schedules. A six-week test that runs from May 12 to June 22 will span at least one billing cycle on every network and at least one currency-conversion cycle if the spend currency differs from the network’s billing currency. The fix is trivial — record every billing event in the analyst’s spreadsheet in both the billing currency and the analyst’s base currency at the day-of-event exchange rate — and the failure mode is that nobody records the FX rate, and the final spend reconciliation is 2-3% off across the four networks. Two per cent of $50,000 is $1,000, which is enough to flip a marginal verdict.

The “best network on day 30” trap. The final and most important error. If the test produces a different ranking on day 30 versus day 43, the day-43 ranking is the one that matters, because the day-30 ranking is a snapshot taken before the postback drain has completed and before the slower-converting offers have fully cycled. Brand teams under quarterly-review pressure are sometimes tempted to “lock in” the day-30 result early. Do not. The marginal cost of running the test to day 43 is, by definition, 30% of the original test budget, which is the same 30% you have already committed. Run the test to the end. The verdict that survives a year is the day-43 verdict.

7. Reading the result — how to write a verdict that survives a year

Editorial illustration of a verdict template laid flat on a slate-grey desk with a fountain pen and reference card

The verdict is the cheapest part of the test and the part the trade press spends the most time on. The expensive parts — the design, the instrumentation, the running, the error-correction — are upstream. The verdict is what gets written, published, and read; it is also what, if written badly, makes the upstream work invisible to the reader. A verdict that survives a year is a verdict that, twelve months from the publication date, reads as a fair characterisation of the test against the use case it was scoped against. Most trade-press verdicts do not survive six months, not because the networks have changed but because the verdict was written too broadly. The fix is a structured verdict template, prepared on day zero, that forces the reviewer to make claims at the level of the use case and not at the level of the universal category.

The verdict template, in the form used on this site, has six sections.

Section one is the use-case context. One paragraph that names the advertiser profile, the GEO mix, the vertical, the spend tier, the attribution window, and the duration. The point of the use-case context is to bound the verdict’s portability. If the verdict is read by a Series B nutra advertiser in tier-1 western Europe spending $30,000 a month, the verdict is in-scope; if it is read by a tier-2 publisher in Indonesia, the verdict is out of scope. The use-case context is the disclaimer that makes this explicit.

Section two is the headline verdict. Two to four sentences, no more. “For X advertiser doing Y in Z region, my recommendation is _. The runner-up is _. The exact tradeoff between them is _.” This is the section that gets quoted; it is also the section that gets quoted out of context. The fix for the quotation problem is that the headline verdict always names the use-case context inline, even at the cost of feeling repetitive. “For a Series B DTC nutra advertiser spending £20,000–£40,000 monthly into Tier-1 western European GEOs, Adsy is the recommendation; Mondiad is the runner-up; the tradeoff is transparency-versus-fill-rate” reads slightly clunky in isolation and stops the quotation from leaving the context behind.

Section three is the per-axis scoring. Each of the five framework axes — CPM floor, format breadth, GEO depth, anti-fraud, payment terms — gets a one-to-five score per network, with one sentence of justification per score. The justification sentence cites the specific test data that produced the score: “Adsy scored 4 on CPM floor because the eCPM was within 8% of the rate-card claim at the $30,000 monthly spend tier across all four GEOs in the test; the comparable figures for the other three networks were 22%, 31%, and 14% above rate card.” This is the section that the named-incumbent networks will dispute, and that is fine — the dispute is the value the test adds.

Section four is the use-case-weighted scoring. The same five axes, weighted per the use-case profile, summed into a single per-network score. The weighted scoring is what produces the headline verdict; it is also the section that makes the headline verdict reproducible if a third party wants to reweight for a different use case. Publish the weights.

Section five is the surprises. A short paragraph naming the day-zero pre-registered hypothesis and the day-43 result, with the surprises called out. “I expected PropellerAds to win on GEO depth in tier-1; the test result shows PropellerAds tied with RichAds, both clearly behind Adsy, which was a surprise. The reason, on the post-hoc analysis, appears to be that PropellerAds’ tier-1 publisher base has shifted toward second-tier inventory in the last 18 months.” The point of this section is that the surprises are the most useful information in the test, and pretending the result matched expectations is a missed opportunity for the reader.

Section six is the “skip this if” / “don’t use for” paragraph per network. The most important paragraph in the verdict, and the one almost no comparison piece on the open web provides. “Skip Adsterra if your offer needs more than five seconds of consideration before the click. Skip PropellerAds self-serve if you need a dedicated account manager below $10,000 monthly spend. Skip Monetag if your GEO mix is more than 60% tier-1 — better economics exist on AdPushup at that volume.” This is the section that gets a verdict cited in a year’s time, because it is the section that survives the network-feature changes that happen in the intervening twelve months. Networks change pricing, formats, and dashboards on a six-month cycle; the “don’t use for” criteria change much more slowly because they are properties of the network’s positioning, not its product release schedule.

A footnote on the verdict template: the template is iterated, lightly, every quarter on this site. The version I use in mid-2026 has six sections; the version I used at the consultancy in 2023 had four; the version that lands in the editorial workflow in 2027 will probably have seven, because I want to add a “monitoring trigger” section that names the conditions under which the verdict should be re-tested. The principle is constant. The implementation is iterative.

The other thing the verdict does, when it works, is force the reviewer to be specific in the places the reviewer is tempted to be general. The headline verdict does not say “Adsy is the best native network.” It says “Adsy is the best native network for a Series B DTC advertiser spending £20,000–£40,000 monthly into tier-1 western European GEOs.” The per-axis scoring does not say “Adsy is transparent.” It says “Adsy scored 4 of 5 on CPM floor because the transacted eCPM was within 8% of the rate-card claim at the tested spend tier.” The skip-this-if section does not say “Adsy may not suit every advertiser.” It says “Skip Adsy if your offer relies on tier-2 LATAM volume; Adsy’s tier-2 LATAM fill rate is roughly half of Monetag’s at comparable spend tiers.” Each specificity is a year of reviewer-discipline distilled into a template. It is the discipline the trade-press category is allergic to. It is also the discipline that makes a verdict useful to the reader twelve months out.

8. FAQ — ten questions

1. How many networks should I test in parallel — three, four, or five?

Four is the standard recommendation. Three is too few to produce a useful ranking; the variance between two networks on a single test is too high to draw confident conclusions about a candidate set of three. Five is operationally manageable but adds disproportionate analyst overhead — five-network designs require either a larger statistical-power budget or a tighter alpha threshold to avoid false positives across the five-network comparison set, and the marginal information from the fifth network is rarely worth the cost. Four is the sweet spot between statistical power, operational complexity, and per-network sample size at a $40,000–$60,000 total test budget. If the test budget is below $25,000, drop to three networks; if the budget is above $100,000, five networks is defensible.

2. Can I test fewer than 12 cells if my team does not have the analyst capacity?

You can, and the result will be weaker. The minimum defensible cell count is six — three creative variants × two GEO buckets, with day-parting and landing-page held constant. Below six cells the main-effect estimate is so noisy that the four-network ranking will not survive replication. The compromise version of the compromise version is to run the test on six cells, accept that the verdict will be approximate, and disclose the cell count in the methodology section. Disclosure is the move that distinguishes an approximate verdict from a misleading verdict. Six cells with disclosure is more honest than 72 cells without.

3. What is the minimum spend per network to make the test meaningful?

The minimum useful spend per network is the spend at which the network treats the buy as a real production buy rather than a trial. That threshold varies by network, but the rough rule is $2,500 to $5,000 monthly equivalent for tier-1 networks (Adsy, AdPushup, Mondiad), $1,000 to $2,500 for mid-tier (RichAds, Adcash), and $500 to $1,000 for tier-2 (Adsterra self-serve, Monetag self-serve, Clickadu, ExoClick). Below these thresholds the network’s account-management treatment of the test differs from the production treatment, and the test result does not generalise. The right way to think about the budget is: the lowest spend tier at which the network would assign you a dedicated account manager and treat you as a real customer is the lowest tier at which the test produces a verdict that matches production behaviour.

4. How do I handle differing minimum-deposit thresholds across the four networks?

Most named-incumbent networks have minimum-deposit thresholds in the $100 to $1,000 range; a few of the premium networks operate at $5,000 to $10,000 minimums for new advertisers. If the thresholds differ across the four candidate networks, the practical compromise is to fund each network’s deposit at its individual minimum and to budget the test spend across the higher of the per-network deposit and the per-network test-spend allocation. In a four-network test where Adsy’s minimum is $10,000 and Adsterra’s is $100, the test runs against $10,000 on Adsy and $10,000 on Adsterra anyway, because the test design holds the per-network spend constant. The deposit minimum is a fact of the network’s commercial posture. It is not a tax on the test design.

5. What attribution window should the test use — 24-hour click, 7-day click, or 30-day click?

Use the attribution window that matches the underlying offer’s consideration cycle. For DTC e-commerce offers under $100 average order value, 24-hour click is appropriate. For DTC e-commerce offers above $200 AOV, 7-day click is appropriate. For lead-gen offers with a 14-30 day sales cycle, 30-day click is appropriate, and the test duration should be extended to nine weeks to clear the full attribution window. The single most important point is that the window must be the same across all four networks at the tracker level. The networks will each default to a different window; the tracker normalises them; the verdict uses the tracker’s normalised number.

6. How do I structure the day-zero pre-registered hypothesis?

Write down, in plain prose, what you expect the day-43 result to be on each of the five framework axes for each of the four networks. Predict the per-axis ranking; predict the use-case-weighted ranking; predict the surprises. Save the document with a date-stamped commit (Git, Dropbox, email-to-self — any tamper-evident store will do). On day 43, open the pre-registration and compare. The pre-registration is not graded. It does not affect the verdict. It exists to make the size of the surprise legible to a future reader and to discipline the reviewer against post-hoc rationalisation.

7. The test result was inconclusive. What do I do?

If the day-43 result produces overlapping confidence intervals for the top-versus-second comparison, the honest verdict is “we cannot distinguish between Network A and Network B at this spend tier and this duration.” That is a legitimate verdict and one the trade press almost never publishes. The next step is to run a follow-up test at a higher spend tier or a longer duration, or to declare the choice between A and B on non-quantitative grounds (account-manager fit, integration support, payment terms — Section 7’s skip-this-if criteria). What you do not do is round the inconclusive result up to a verdict. Inconclusive is information. Pretending it is conclusive is the trade-press shortcut.

8. How often should I re-test the same set of networks?

The networks change features, pricing, and policies on a six-month cycle on average. A re-test is justified every nine to twelve months for a single use case, with the gap shrinking to six months for fast-moving verticals (iGaming, crypto) and stretching to eighteen months for slow-moving verticals (telco-billing, certain B2B SaaS). The re-test does not need to be a full four-network parallel buy each time; a “spot check” — a single network re-tested against the previous verdict — is sufficient if no major network features have shifted in the interval.

9. The account manager wants to “optimise” the campaign mid-test. How do I say no politely?

The standard line, communicated to the account manager before the test starts: “We are running a methodology-controlled test across four networks. We are holding all parameters constant for six weeks to produce a clean comparison. We will be happy to discuss optimisations after the test concludes.” Most account managers respect this. The ones who push hardest are typically the account managers at networks that perform less well in the parallel-buy design because the optimisation is what closes the gap with the better-performing networks. That observation is itself useful test data and worth recording.

10. Should the test be blinded — should the analyst know which network is which during the day-30 read?

Yes. The 2022 Marketing Science replication on vendor-evaluation blinding found 0.7-sigma rating inflation when reviewers knew the vendor identity. Blinding the analyst’s day-30 read against the network labels — by anonymising the network names as A, B, C, D in the tracker dashboard for the duration of the test — produces a more honest scoring on the five-axis framework. The reveal happens on day 43, after the verdict has been provisionally written against the anonymised data. Blinding is operationally annoying. It is also the methodology refinement that produces the largest reliability gain per hour of analyst time, and it is the refinement that almost no trade-press comparison test bothers with.

Frequently asked questions

Why are most “best ad network” lists unreliable?

Not because the author is lying about who paid them, but because trade-publishing economics have selected for fifteen years for writers who can produce comparison content that does not embarrass any sponsoring network. The winning format for that economics is the “Top 10” where every network wins on something, and almost every comparison on the open web is that piece. The fix is to stop reading rankings and run your own parallel buy — the only methodology that survives a quarterly review meeting.

What is a parallel-buy test and why does it beat a sequential trial?

A parallel buy runs four candidate networks against each other simultaneously, on the same offers, same creative rotation and same GEO mix, with budget split evenly over six weeks. The sequential single-network trial — four networks tested in series over six months, judged by the last one tested — is structurally rigged in favour of recency and whichever account manager followed up most politely. The networks that win a parallel buy are usually not the ones that win the sequential trial, and that gap is the single most important observation in the category.

What are the five axes for comparing ad networks?

CPM floor, format breadth, GEO depth, anti-fraud posture and payment terms. Each is scored one to five per network and weighted by use case rather than averaged into a composite — a composite is exactly what produces the universalist rankings the category is rightly criticised for. A Series B DTC advertiser weights CPM floor and anti-fraud heavily; a Tier-2 publisher weights payment terms and minimum payout heavily. The weights move with the profile, and so does the verdict.

How big does the sample need to be for a defensible result?

Larger than the industry habit of reading a winner off fifty conversions. For a 2% baseline conversion rate, alpha 0.05 and power 0.80, detecting a 10% relative effect needs roughly 17,000 clicks per network — about 68,000 across a four-network test. The trade-press shortcut runs the test on a few thousand dollars of spend, declares a winner and never mentions the confidence interval. That shortcut is precisely why the category appears to have no signal.

Which tracker should I use — Voluum, Bemob or RedTrack?

By spend tier. Above $50,000 across the four networks, use Voluum for its mature multi-network postback templating and sub-source disaggregation. Between $15,000 and $50,000, Bemob is genuinely good and undermarketed. Below $15,000, RedTrack is fine and the choice probably does not move the result. The decision is about the cost of an attribution mismatch at each spend tier, not about raw tracker quality — and whichever you pick, the verdict reads off the CRM, not the network panel.

When should I read the result and declare a winner?

Day 43, not day 7 and not day 30. The day-7 read is diagnostic only — it confirms postbacks are firing and budget is pacing, and almost everything you learn about the networks in the first seven days is wrong. Day 30 is the early signal but the confidence intervals for second versus third place still overlap about half the time. The day-43 verdict, after the postback drain completes, is the read the test was designed to produce, and it is the one that survives a year.


Methodology appendix. This article is editorial, not the report of a single live test; the methodology described is the one used across this site’s review programme during the four parallel-buy tests completed in Q1 and Q2 2026 (the most recent being the popunder tier-1 test published in March, the native tier-1 test published in April, the push tier-2 LATAM test published in May, and the in-page push tier-2 SEA test currently in the day-30 window). Total test spend across the four reviews: $187,400. Networks tested: Adsy, Adsterra, PropellerAds, RichAds, Monetag, Mondiad, Adcash, AdPushup, ExoClick, Clickadu — not all in every test. Tracking: Voluum for two tests, Bemob for two. Conflict-of-interest disclosure: this site holds no paid placement relationship with any of the named networks; affiliate-link relationships, where they exist, are disclosed on individual review pages above the methodology paragraph rather than in a footer. If the methodology has a hole, the address in the footer reaches the editor.

Privacy

Your privacy choices

We use cookies to operate the site and, with your consent, to measure usage and personalize content. You can change your choices anytime.

Accessibility

Accessibility settings

Customize how the site looks and moves. Saved to this browser only.