Schema Design for Link Graphs in OLAP: Fast PageRank-ish Metrics at Scale
link graphdatabasesSEO

Schema Design for Link Graphs in OLAP: Fast PageRank-ish Metrics at Scale

ccrawl
2026-02-12
10 min read
Advertisement

Design link-graph schemas and ETL for ClickHouse to compute fast PageRank-ish metrics, domain authority, and scalable backlinks analytics.

If you run large-scale crawls and still can’t answer basic questions — which pages have the most backlinks, what changed since yesterday, or whether a domain’s authority rose after a link cleanup — you’re suffering from a data modeling problem, not a crawling problem. Storing the web’s link graph in a traditional row-store or a graph DB works for small datasets, but it breaks when you want fast analytics, iterative algorithms (PageRank-ish scores), and low-latency joins against other SEO signals at web scale.

In 2026, many teams are moving their link graphs into OLAP systems (ClickHouse at the forefront) to get both scale and SQL-composability. This guide shows practical ETL patterns, and iterative strategies to compute backlink metrics, domain authority approximations, and PageRank-style scores in ClickHouse — with real-world trade-offs and code snippets you can adapt today.

OLAP engines improved dramatically in late 2024–2025 and into 2026: faster vectorized execution, better distributed joins, user-defined functions, and better support for iterative workflows. ClickHouse’s continued investment and ecosystem momentum (notably new funding and product velocity in 2025) made it a practical choice for teams that need high-throughput ingestion and low-latency analytical queries on billions of edges.

The practical consequence: instead of a special-purpose graph store, you can run cheap, parallelizable PageRank-ish computations inside your OLAP stack and combine link signals with crawl metadata, content scores, and crawl frequency directly in SQL and materialized views.

  • Write-optimized raw capture — append-only raw edges from crawls, partitioned for fast ingest and TTL for stale snapshots.
  • Compact aggregated edges — a deduped, aggregated view of edges with counts and degrees for fast joins.
  • Node registry — mapping from URL to normalized node id (URL-level) and host-level ids for domain aggregation.
  • Fast iterative PR-ish — an execution path to perform many iterations over the graph with bounded memory and I/O.
  • Incremental updates — small daily delta processing rather than recomputing everything from scratch.

We'll present a minimal, pragmatic schema: raw_edges, edges_agg, nodes (URL->node_id), domains (node_id->domain_id), and pr_scores (iterative). This is deliberately normalized for storage cost and update performance.

1) raw_edges — append-only capture from your crawler

Purpose: capture every outgoing link extracted by crawls (with metadata). Keep it write-optimized and cheap.

CREATE TABLE raw_edges (
  crawl_date Date,
  from_url String,
  to_url String,
  http_status UInt16,
  anchor String,
  rel String,
  is_nofollow UInt8,
  fetch_id String -- optional unique crawl id
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(crawl_date)
ORDER BY (from_url, to_url, crawl_date);

Notes: partition by month for retention and easier TTL. Store URLs rather than long integers at ingest time to reduce ETL friction; resolve ids in a downstream normalization pass.

2) nodes — normalized URL -> node_id mapping

Purpose: keep a compact integer id per distinct canonical URL for joins and compact edge storage.

CREATE TABLE nodes (
  node_id UInt64,
  url String,
  canonical_url String,
  last_seen DateTime,
  url_hash UInt64
) ENGINE = ReplacingMergeTree(last_seen)
PARTITION BY toYYYYMM(last_seen)
ORDER BY (url_hash, node_id);

Populate node_id by hashing canonical_url or by using an auto-increment generator in your ETL. A siphash or murmur64 helps with partitioning and dedupe. Keep last_seen for compacting unused nodes with TTL.

3) edges_agg — deduped and aggregated edges (compact graph)

Purpose: store the condensed graph: one row per (from_id, to_id) with counts and precomputed degrees.

CREATE TABLE edges_agg (
  from_id UInt64,
  to_id UInt64,
  out_count UInt32 DEFAULT 1,
  last_updated DateTime
) ENGINE = SummingMergeTree
PARTITION BY intDiv(from_id, 1000000)
ORDER BY (from_id, to_id);

SummingMergeTree helps accumulate counts for repeated edges. Partitioning by a shard-friendly function (intDiv(hash, N)) spreads data evenly.

4) domains — collapse node_id -> domain_id (host / eTLD+1)

Domain-level aggregation is essential for domain authority approximations — collapsing tens of billions of URL nodes into tens of millions of hosts.

CREATE TABLE domains (
  domain_id UInt32,
  host String,
  last_seen DateTime
) ENGINE = ReplacingMergeTree(last_seen)
ORDER BY domain_id;

5) pr_scores — store iterative PageRank-ish scores

We store both URL-level and domain-level scores using the same pattern.

CREATE TABLE pr_scores (
  node_id UInt64,
  score Float64,
  iteration UInt32,
  updated_at DateTime
) ENGINE = MergeTree()
PARTITION BY iteration
ORDER BY node_id;

ETL patterns: from crawler to compact graph

High-level steps:

  1. Stream raw link tuples into raw_edges (fast inserts).
  2. Canonicalize URLs and resolve redirects to canonical_url; write/merge into nodes table.
  3. Map raw_edges(from_url,to_url) to (from_id,to_id) using nodes; write to a staging edges table of ids.
  4. Materialize or insert into edges_agg via aggregation (SummingMergeTree or an AggregatingMaterializedView).
  5. Compute domain mapping and domain-level edges with GROUP BY domain pairs for domain authority.

Practical ETL snippet (Python + clickhouse-driver)

-- pseudo-Python outline
# 1. Insert raw edges
client.execute('INSERT INTO raw_edges (crawl_date, from_url, to_url, http_status) VALUES', rows)

# 2. Bulk canonicalize with multiprocessing (use your URL-normalizer)
# 3. Upsert nodes (use node_id = hash64(canonical_url))
# 4. Resolve to ids and insert into edges_id_staging
client.execute('INSERT INTO edges_id_staging (from_id, to_id, crawl_date) VALUES', id_rows)

# 5. Aggregate into edges_agg periodically
client.execute('INSERT INTO edges_agg SELECT from_id, to_id, count() AS out_count, now() FROM edges_id_staging GROUP BY from_id, to_id')

Materialized views: make aggregation near-real-time

ClickHouse’s materialized views let you fold raw inserts into aggregated datasets without extra nightly jobs. Create a materialized view that consumes raw or id-staging tables and writes into edges_agg. Use a SummingMergeTree target so ClickHouse deduplicates and sums counts efficiently.

CREATE MATERIALIZED VIEW mv_edges_agg TO edges_agg AS
SELECT
  from_id,
  to_id,
  count() AS out_count,
  now() AS last_updated
FROM edges_id_staging
GROUP BY from_id, to_id;

Computing PageRank-ish scores in ClickHouse

The classic power-iteration algorithm is a natural fit: repeatedly multiply the adjacency matrix by the rank vector. In ClickHouse this becomes a series of JOINs and GROUP BYs. Two optimizations are critical:

  • Precompute out-degrees so each iteration is a single JOIN + GROUP BY.
  • Run domain-level first to get coarse signals fast and then refine at the URL level for high-value subsets.

Precompute degrees

ALTER TABLE edges_agg UPDATE out_degree = out_count WHERE 1;
-- Or compute a separate table:
CREATE TABLE out_degree AS
SELECT from_id, SUM(out_count) AS out_degree FROM edges_agg GROUP BY from_id;

One iteration in SQL

Assume pr_curr has rows (node_id, score) and edges_agg has (from_id, to_id, out_degree).

INSERT INTO pr_scores (node_id, score, iteration, updated_at)
SELECT
  e.to_id AS node_id,
  SUM(p.score / e.out_degree) * (1 - damping) + teleport AS score,
  {next_iter} AS iteration,
  now() AS updated_at
FROM edges_agg AS e
JOIN pr_curr AS p ON e.from_id = p.node_id
GROUP BY e.to_id;

Where teleport = damping_correction (e.g., damping * (1 / N)). Implement damping and teleport constant in your client or as a WITH expression. Replace {next_iter} with the iteration index.

Practical tips for iteration scale

  • Broadcast small tables: keep pr_curr in-memory on each ClickHouse server (use a Join engine table or small Distributed table) to avoid large shuffle.
  • Run a limited number of iterations (10–20) for approximate convergence; for domain-level graphs fewer iterations suffice.
  • Use sampling for quick estimates: run iterations on a sampled subgraph (1% of nodes) to sanity-check changes before a full run.
  • Persist intermediate iterations so you can resume after failures. Partitioning pr_scores by iteration makes rollbacks simple.

Domain authority approximation: collapse and compute

Domain-level metrics are often what SEOs want: fewer nodes, faster converging iterations, and stronger signal aggregation. Build domain edges from node->domain mapping and then run the same iterative process.

-- domain_edges: reduced graph
INSERT INTO domain_edges
SELECT
  d_from.domain_id AS from_domain,
  d_to.domain_id AS to_domain,
  count() AS weight
FROM edges_agg AS e
JOIN domains AS d_from ON e.from_id = d_from.node_id
JOIN domains AS d_to   ON e.to_id   = d_to.node_id
WHERE d_from.domain_id != d_to.domain_id
GROUP BY from_domain, to_domain;

Then run the iterative algorithm on domain_edges; the result is a fast, robust domain authority approximation useful for ranking signals and regression models.

Incrementality: how to update scores daily without recomputing everything

For production SEO dashboards you want near-daily freshness without heavy compute. Strategies:

  • Delta ingestion: track changed/added edges since last run and only recompute the affected connected components. Streaming or near-real-time approaches (see monitoring and real-time patterns) help here.
  • Warm start: load last saved scores as pr_curr, and run fewer iterations with the new deltas. Scores will converge quickly if deltas are small.
  • Hybrid scheduling: run lightweight daily domain-level updates and heavier weekly URL-level recomputations for top N domains/pages.

Performance tuning and cost controls

  • Partitioning: shard by node hash or time buckets for raw captures; ensure edges_agg partitions avoid hotspotting.
  • Compression: store node ids as UInt64 with compression codecs; ClickHouse codecs like LZ4 or ZSTD with tuned level reduce storage of repetitive edges.
  • Materialized views: offload aggregation work to ClickHouse to avoid nightly heavy ETL pipelines.
  • Sampling & early stopping: for trend detection you don’t need exact PageRank; early stopping at an L2 difference threshold saves cycles. Consider combining edge-local compute and affordable edge deployments for low-latency checks.

When not to use OLAP: trade-offs

OLAP shines for analytics and iterative algorithms that are mostly read-heavy. But if you need single-hop graph traversals with low latency per-request (e.g., per-user recommendations or deep traversal queries), a graph-database cache or in-memory graph may still be necessary. OLAP is best used as the canonical analytics store and model trainer. For sub-second per-request needs consider serverless or microservice options discussed in the serverless face-off.

Real-world scale notes and a short case study

From experience, moving from URL-level graphs of 20B edges to a host-level graph of ~200M edges reduced iteration time from many hours to under 30 minutes for 10 iterations on a moderate ClickHouse cluster. Teams that prioritized domain-level signals saw near-instant improvements in crawl prioritization and link-discovery pipelines.

"We cut our weekly PageRank job from 12 hours to 40 minutes by materializing edges into SummingMergeTree and collapsing to hosts first." — Senior Search Engineer (anonymous)
  • OLAP+Graph hybrid features: expect OLAP engines to offer more graph primitives (faster distributed joins, built-in PageRank functions, and UDFs) as demand grows.
  • Edge-aware ML features: combining link graph signals directly with embeddings and ranking models will become standard for SEO platforms.
  • Streaming graph deltas: low-latency materialized pipelines will let teams react to link spikes in near real time (useful for spam detection and outreach monitoring). See patterns for near-real-time monitoring.

ClickHouse’s continued product momentum since the large 2025 funding round means more managed options and integrations in 2026 — lowering the operational cost of running big link-graph analytics. For managed and cloud design patterns see cloud-native architectures discussions.

  1. Model raw_edges for append-only ingest with partition TTL.
  2. Create nodes table and normalize canonical URLs in ETL.
  3. Aggregate to edges_agg via materialized views and SummingMergeTree.
  4. Precompute out-degrees and use them in iteration SQL.
  5. Start with domain-level PageRank to reduce surface area.
  6. Implement delta ingestion and warm-start iterative runs.
  7. Monitor convergence and store iteration snapshots for reproducibility.

Actionable takeaways

  • Don’t” store raw edges only — raw captures are cheap but slow to query. Fold into aggregated edges early.
  • Use SummingMergeTree or aggregated materialized views to dedupe edges without separate batch jobs.
  • Compute domain authority first — it’s fast, stable, and actionable for SEO teams.
  • Implement incremental iterations by applying crawler deltas to a warm-started pr_scores table.

Next steps (call-to-action)

Ready to try this in your environment? Start by piping a week of raw edges into a ClickHouse test cluster and materialize an edges_agg table. If you want a reference implementation, clone our sample ETL and ClickHouse DDL repository (contains URL-normalizer examples, materialized view templates, and iterative SQL). For consultancy or cluster architecture reviews tuned to multi-billion-edge graphs, reach out — we help teams reduce PageRank runs from hours to minutes.

Want the repo link and a short checklist PDF? Contact us or subscribe for a hands-on walkthrough where we tune partitions, codecs, and iteration performance on your dataset.

Advertisement

Related Topics

#link graph#databases#SEO
c

crawl

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-12T10:28:04.124Z