CLIlogstools

CLI Toolkit: Quick Commands to Diagnose Crawlability Issues from Logs

ccrawl

2026-02-11

8 min read

A compact CLI cheat-sheet (grep, jq, ClickHouse) to triage crawlability and indexation failures fast for engineers and admins.

Sites not getting indexed? If search visibility is slipping, the fastest way to triage is your command line and logs. This cheat-sheet gives engineers and IT admins compact, repeatable CLI commands (grep, jq) and scalable ClickHouse SQL samples you can run now to diagnose crawlability and indexation failures.

Why this CLI toolkit matters in 2026

Search engines, stricter bot management, and larger websites changed the game in 2024–2026. ClickHouse adoption for log analytics surged after major investments in late 2025, making OLAP queries practical for tens of billions of rows. Meanwhile, CDNs, WAFs, and modern bot throttling mean a single server log slice no longer tells the whole story. That’s why you need a hybrid approach:

Quick triage via shell tools (grep, awk, sed, jq) for single-host diagnostics.
Aggregated queries in ClickHouse for cross-host, cross-time analysis.
Small automations for CI/CD runbooks to catch regressions before deploys.

Quick triage checklist (inverted pyramid)

Check robots.txt and sitemap discovery.
Confirm search bot fetches and status codes (200 vs 4xx/5xx/429).
Detect intentional blocks (X-Robots-Tag, meta robots, canonical conflicts).
Correlate spike timing with deploys, rate-limits, or WAF rules.
Scale analysis in ClickHouse for long windows and aggregated patterns.

Environment assumptions

Examples assume standard Nginx combined access log format (common), JSON structured logs for jq examples, and a ClickHouse table named access_logs with columns noted in the SQL snippets. Adapt field names to your schema.

Section A — Fast CLI commands for host-level logs

Use these for immediate answers on a host. Replace access.log with your log path and adjust date filters as needed.

1) Find Googlebot and other major crawlers

# Grep lines with Googlebot, Bingbot, and other named crawlers
grep -iE "googlebot|bingbot|baiduspider|yandex|duckduckgo" /var/log/nginx/access.log | tail -n 200

# Get counts by user-agent
grep -iE "googlebot|bingbot|baiduspider|yandex|duckduckgo" /var/log/nginx/access.log \
  | awk -F '"' '{print $6}' \
  | sort | uniq -c | sort -rn | head

Why

Confirm bots are hitting your site and get quick counts. If expected bots are absent, the issue may be upstream (DNS, CDN block, or Search Console warnings).

2) Status-code breakdown for crawlers

# Nginx combined logs: request in field 7, status in field 9
# Adjust awk field numbers if your log format differs

grep -i "googlebot" /var/log/nginx/access.log \
  | awk -F '"' '{req=$2; split(req,a," "); path=a[2]; rest=$3; gsub(/ /,"",rest); split(rest,b," "); status=b[1]; print status "\t" path }' \
  | sort | uniq -c | sort -rn | head -n 50

# Quick percent of 4xx/5xx
TOTAL=$(grep -i "googlebot" /var/log/nginx/access.log | wc -l)
ERRORS=$(grep -i "googlebot" /var/log/nginx/access.log | awk '{print $9}' | egrep "^[45]" | wc -l)
echo "Total: $TOTAL, Errors: $ERRORS, Error%: $(awk 'BEGIN{printf "%.2f",('$ERRORS'/'$TOTAL')*100}')%"

3) Detect 429 / rate limiting

# Find 429 responses for bots (rate-limited) grep -i "googlebot" /var/log/nginx/access.log | awk '{print $1"\t"$4"\t"$9"\t"$7}' | egrep "\t429\t" # Count unique bot IPs receiving 429 grep -i "googlebot" /var/log/nginx/access.log | awk '{print $1"\t"$9}' | egrep "\t429\z" | cut -f1 | sort | uniq -c | sort -rn | head

4) Check robots.txt fetches and sitemap hits

# Who fetched robots.txt and with what status? grep "GET /robots.txt" /var/log/nginx/access.log | awk -F '"' '{print $1"\t"$2"\t"$3}' | tail -n 50 # Sitemap fetches (sitemap.xml and sitemap index) grep -i "sitemap.xml" /var/log/nginx/access.log | awk '{print $1"\t"$4"\t"$9"\t"$7}' | tail -n 50

5) Inspect response headers for X-Robots-Tag and caching

If you store response headers in logs or a JSON field, jq is your tool. Example below assumes JSON lines with fields: time, remote_ip, request, status, req_headers, res_headers.

# Show responses that included an X-Robots-Tag with 'noindex' jq -r 'select(.res_headers!=null) | select(.res_headers | contains("X-Robots-Tag")) | {time:.time, request:.request, status:.status, headers:.res_headers}' /var/log/nginx/access.jsonl \ | jq -r 'select(.headers | test("noindex";"i"))'

Section B — JSON logs with jq (structured logging)

Structured logs (JSON) are gold. Use jq for targeted, machine-friendly queries.

1) Count unique URLs crawled by Googlebot

jq -r 'select(.user_agent|test("googlebot";"i")) | .request' access.jsonl | cut -d' ' -f2 | sort | uniq -c | sort -rn | head

2) Find recent 5xx responses from crawlers

jq -r 'select(.status>=500) | select(.user_agent|test("googlebot|bingbot";"i")) | {time:.time, status:.status, request:.request, host:.host}' access.jsonl | jq -s '.' | jq '.[0:200]'

3) Aggregate by hour to spot rate-limit windows

jq -r 'select(.user_agent|test("googlebot";"i")) | .time[0:13] as $h | {hour:$h, status:.status} ' access.jsonl \ | jq -s 'group_by(.hour) | map({hour: .[0].hour, total: length, errors: (map(select(.status>=400)) | length)})'

Section C — Scalable triage using ClickHouse (OLAP)

ClickHouse enables fast, ad-hoc aggregation across massive log volumes. Below are sample SQL snippets appropriate for ClickHouse 2026 clusters. Adapt column names to your schema. Assumes a table access_logs with columns: event_time DateTime, host String, path String, status UInt16, user_agent String, method String, remote_ip String, req_headers String, res_headers String.

1) Top crawled paths by Googlebot, last 7 days

SELECT path, count(*) AS hits FROM access_logs WHERE event_time >= now() - INTERVAL 7 DAY AND lower(user_agent) LIKE '%googlebot%' GROUP BY path ORDER BY hits DESC LIMIT 100

2) Status distribution for crawlers (by day)

SELECT toDate(event_time) AS day, status, count() AS cnt FROM access_logs WHERE event_time >= now() - INTERVAL 30 DAY AND (lower(user_agent) LIKE '%googlebot%' OR lower(user_agent) LIKE '%bingbot%') GROUP BY day, status ORDER BY day DESC, cnt DESC

3) Detect mass 429 events (rate-limiting spikes)

SELECT toStartOfMinute(event_time) AS minute, count() AS total, countIf(status = 429) AS hits_429 FROM access_logs WHERE event_time >= now() - INTERVAL 48 HOUR AND lower(user_agent) LIKE '%googlebot%' GROUP BY minute HAVING hits_429 > 0 ORDER BY minute ASC

4) Which IPs are getting blocked or throttled most?

SELECT remote_ip, countIf(status = 429) AS throttled, countIf(status >= 500) AS server_errors, count() AS total FROM access_logs WHERE event_time >= now() - INTERVAL 7 DAY AND lower(user_agent) LIKE '%googlebot%' GROUP BY remote_ip ORDER BY throttled DESC, server_errors DESC LIMIT 50

5) Find responses with X-Robots-Tag: noindex in headers

SELECT path, count() AS cnt FROM access_logs WHERE event_time >= now() - INTERVAL 30 DAY AND match(res_headers, '(?i)X-?Robots-Tag:.*noindex') GROUP BY path ORDER BY cnt DESC LIMIT 100

6) Example DDL (create table)

CREATE TABLE access_logs ( event_time DateTime, host String, path String, method String, status UInt16, remote_ip String, user_agent String, req_headers String, res_headers String, bytes UInt64 ) ENGINE = MergeTree() PARTITION BY toYYYYMM(event_time) ORDER BY (event_time)

Practical triage playbook (step-by-step)

Run these steps in order when indexation drops or Search Console shows crawl errors.

robots.txt & sitemap

CLI: curl -sS https://example.com/robots.txt | sed -n '1,120p'

Ensure sitemap lines exist. If sitemap points to a 404, bots won’t discover pages reliably.

Search bot presence

Host logs: check for Googlebot/Bingbot in last 48 hours (grep examples above).

If absent, check DNS, firewall, CDN, or search console selectors.

Response codes & throttles

Use the ClickHouse 48-hour 429 spike query to detect throttling windows.

Correlate with deploy times and WAF rule updates.

Intentional exclusions

Search for X-Robots-Tag: noindex and <meta name="robots" content="noindex"> across responses (jq & ClickHouse examples).

Canonical and redirect cascades

Spot redirect loops by finding many 3xx sequences for canonical URLs in logs.

Automate checks in CI

Run a compact script (sample below) as part of deploy pipelines to fail when robots.txt blocks or X-Robots-Tag is set to noindex on production pages.

Sample CI check (bash)

# ci-crawl-check.sh set -e URLS=("https://example.com/" "https://example.com/product/123") for url in "${URLS[@]}"; do if curl -sSL -I "$url" | grep -i "X-Robots-Tag:.*noindex" >/dev/null; then echo "ERROR: $url has X-Robots-Tag:noindex"; exit 1 fi if curl -sS "$url" | grep -qi ']*name=["\']robots["\'][^>]*content=["\']?noindex'; then echo "ERROR: $url contains meta robots noindex"; exit 1 fi done echo "CI crawl checks passed"

2026 trends and future-proofing tips

Move more logs to OLAP: ClickHouse and other OLAP systems became default for scale. Centralize logs (edge, CDN, origin) to avoid blind spots.

Index discovery matters: Sitemaps and sitemap indexes will remain critical for very large sites—automate verification that sitemaps are reachable and referenced in robots.txt.

Bot fingerprinting and anti-bot changes: Modern bot management may modify headers. Use canonical verification and Search Console reports to cross-check behavior.

CI integration: Run lightweight checks before and after deploys. Prevent regressions that accidentally expose noindex or block crawlers. Consider adding a Docker-based test harness for reproducible validation.

Actionable takeaways

Start with quick host-level grep and jq checks to unblock immediate issues.

Scale analysis with ClickHouse to find correlated patterns across hosts and time windows.

Automate these checks in CI to catch indexation regressions before they reach production.

Centralize logs (edge, CDN, origin) so your triage works across modern architectures.

“When in doubt, search the logs: they rarely lie. Combine fast-shell checks with OLAP queries to move from suspicion to root cause in minutes, not days.”

Where to go next

Use this cheat-sheet as a baseline. Customize the patterns for your log format and add specific checks for your CDN and WAF headers. If you run ClickHouse, dump one week of logs into a test_access_logs table and run the SQL snippets above to validate assumptions before moving to production queries.

Call to action

If you want a downloadable, CI-ready bundle of these commands (with a Docker-based test harness for ClickHouse), get the free CLI toolkit from our repo and a ready-to-run GitHub Actions workflow for pre-deploy crawl checks. Implement automated triage now and cut mean time to detect for crawlability issues by weeks.

Related Reading

Edge Signals, Live Events, and the 2026 SERP: Advanced SEO Tactics for Real-Time Discovery

Cost Impact Analysis: Quantifying Business Loss from Social Platform and CDN Outages

Edge Signals & Personalization: An Advanced Analytics Playbook for Product Growth in 2026

Architecting a Paid-Data Marketplace: Security, Billing, and Model Audit Trails

Merchandising scents in small stores: lessons from Liberty’s retail leadership changes
From TikTok Moderators to Airport Staff: What the UK ‘Union Busting’ Fight Teaches Aviation Workers
The Truth About 'Placebo' Sports Tech: How to Evaluate New Gear Claims
How Indie Eyewear Brands Can Tell Better Stories—Lessons from a Cocktail Syrup Start-Up
MagSafe and Qi2: Which Wireless Charger Is Right for Your Rental Unit?

Advertisement

Up Next

More stories handpicked for you

reviews•10 min read
Product Review: Crawl.Page Edge Collector v2 — Field Benchmarks, Thermals and Throughput (2026)
architecture•10 min read
How to Architect an Auditing Pipeline into ClickHouse for Daily SEO Health Checks
scraping•10 min read
Designing Privacy-First Web Scrapers for Travel Sites in a Post-Loyalty World

From Our Network

Trending stories across our publication group

backlinks.top
measurement•11 min read
Measuring the Link Impact of Creative Stunts: Metrics, Tools and Attribution Models
caches.link
Privacy•11 min read
Edge Privacy: Why Local Browsers and On-Device AI Need Secure Local Cache Strategies
expertseo.uk
press-relations•10 min read
Pitching Journalists in an AI-First World: Templates That Win Placements and Links

2026-02-11T00:59:56.894Z

Why this CLI toolkit matters in 2026

Quick triage checklist (inverted pyramid)

Environment assumptions

Section A — Fast CLI commands for host-level logs

1) Find Googlebot and other major crawlers

Why

2) Status-code breakdown for crawlers

3) Detect 429 / rate limiting

4) Check robots.txt fetches and sitemap hits

5) Inspect response headers for X-Robots-Tag and caching

Section B — JSON logs with jq (structured logging)

1) Count unique URLs crawled by Googlebot

2) Find recent 5xx responses from crawlers

3) Aggregate by hour to spot rate-limit windows

Section C — Scalable triage using ClickHouse (OLAP)

1) Top crawled paths by Googlebot, last 7 days

2) Status distribution for crawlers (by day)

3) Detect mass 429 events (rate-limiting spikes)

4) Which IPs are getting blocked or throttled most?

5) Find responses with X-Robots-Tag: noindex in headers

6) Example DDL (create table)

Practical triage playbook (step-by-step)

Sample CI check (bash)

2026 trends and future-proofing tips

Actionable takeaways

Where to go next

Call to action

Related Reading

Related Topics

crawl

Up Next

Product Review: Crawl.Page Edge Collector v2 — Field Benchmarks, Thermals and Throughput (2026)

How to Architect an Auditing Pipeline into ClickHouse for Daily SEO Health Checks

Designing Privacy-First Web Scrapers for Travel Sites in a Post-Loyalty World

From Our Network

Measuring the Link Impact of Creative Stunts: Metrics, Tools and Attribution Models

Edge Privacy: Why Local Browsers and On-Device AI Need Secure Local Cache Strategies

Pitching Journalists in an AI-First World: Templates That Win Placements and Links