CLI Toolkit: Quick Commands to Diagnose Crawlability Issues from Logs
CLIlogstools

CLI Toolkit: Quick Commands to Diagnose Crawlability Issues from Logs

ccrawl
2026-02-11
8 min read
Advertisement

A compact CLI cheat-sheet (grep, jq, ClickHouse) to triage crawlability and indexation failures fast for engineers and admins.

Sites not getting indexed? If search visibility is slipping, the fastest way to triage is your command line and logs. This cheat-sheet gives engineers and IT admins compact, repeatable CLI commands (grep, jq) and scalable ClickHouse SQL samples you can run now to diagnose crawlability and indexation failures.

Why this CLI toolkit matters in 2026

Search engines, stricter bot management, and larger websites changed the game in 2024–2026. ClickHouse adoption for log analytics surged after major investments in late 2025, making OLAP queries practical for tens of billions of rows. Meanwhile, CDNs, WAFs, and modern bot throttling mean a single server log slice no longer tells the whole story. That’s why you need a hybrid approach:

  • Quick triage via shell tools (grep, awk, sed, jq) for single-host diagnostics.
  • Aggregated queries in ClickHouse for cross-host, cross-time analysis.
  • Small automations for CI/CD runbooks to catch regressions before deploys.

Quick triage checklist (inverted pyramid)

  1. Check robots.txt and sitemap discovery.
  2. Confirm search bot fetches and status codes (200 vs 4xx/5xx/429).
  3. Detect intentional blocks (X-Robots-Tag, meta robots, canonical conflicts).
  4. Correlate spike timing with deploys, rate-limits, or WAF rules.
  5. Scale analysis in ClickHouse for long windows and aggregated patterns.

Environment assumptions

Examples assume standard Nginx combined access log format (common), JSON structured logs for jq examples, and a ClickHouse table named access_logs with columns noted in the SQL snippets. Adapt field names to your schema.

Section A — Fast CLI commands for host-level logs

Use these for immediate answers on a host. Replace access.log with your log path and adjust date filters as needed.

1) Find Googlebot and other major crawlers

# Grep lines with Googlebot, Bingbot, and other named crawlers
grep -iE "googlebot|bingbot|baiduspider|yandex|duckduckgo" /var/log/nginx/access.log | tail -n 200

# Get counts by user-agent
grep -iE "googlebot|bingbot|baiduspider|yandex|duckduckgo" /var/log/nginx/access.log \
  | awk -F '"' '{print $6}' \
  | sort | uniq -c | sort -rn | head

Why

Confirm bots are hitting your site and get quick counts. If expected bots are absent, the issue may be upstream (DNS, CDN block, or Search Console warnings).

2) Status-code breakdown for crawlers

# Nginx combined logs: request in field 7, status in field 9
# Adjust awk field numbers if your log format differs

grep -i "googlebot" /var/log/nginx/access.log \
  | awk -F '"' '{req=$2; split(req,a," "); path=a[2]; rest=$3; gsub(/ /,"",rest); split(rest,b," "); status=b[1]; print status "\t" path }' \
  | sort | uniq -c | sort -rn | head -n 50

# Quick percent of 4xx/5xx
TOTAL=$(grep -i "googlebot" /var/log/nginx/access.log | wc -l)
ERRORS=$(grep -i "googlebot" /var/log/nginx/access.log | awk '{print $9}' | egrep "^[45]" | wc -l)
echo "Total: $TOTAL, Errors: $ERRORS, Error%: $(awk 'BEGIN{printf "%.2f",('$ERRORS'/'$TOTAL')*100}')%"

3) Detect 429 / rate limiting

# Find 429 responses for bots (rate-limited)

grep -i "googlebot" /var/log/nginx/access.log | awk '{print $1"\t"$4"\t"$9"\t"$7}' | egrep "\t429\t"

# Count unique bot IPs receiving 429
grep -i "googlebot" /var/log/nginx/access.log | awk '{print $1"\t"$9}' | egrep "\t429\z" | cut -f1 | sort | uniq -c | sort -rn | head

4) Check robots.txt fetches and sitemap hits

# Who fetched robots.txt and with what status?

grep "GET /robots.txt" /var/log/nginx/access.log | awk -F '"' '{print $1"\t"$2"\t"$3}' | tail -n 50

# Sitemap fetches (sitemap.xml and sitemap index)
grep -i "sitemap.xml" /var/log/nginx/access.log | awk '{print $1"\t"$4"\t"$9"\t"$7}' | tail -n 50

5) Inspect response headers for X-Robots-Tag and caching

If you store response headers in logs or a JSON field, jq is your tool. Example below assumes JSON lines with fields: time, remote_ip, request, status, req_headers, res_headers.

# Show responses that included an X-Robots-Tag with 'noindex'

jq -r 'select(.res_headers!=null) | select(.res_headers | contains("X-Robots-Tag")) | {time:.time, request:.request, status:.status, headers:.res_headers}' /var/log/nginx/access.jsonl \
  | jq -r 'select(.headers | test("noindex";"i"))'

Section B — JSON logs with jq (structured logging)

Structured logs (JSON) are gold. Use jq for targeted, machine-friendly queries.

1) Count unique URLs crawled by Googlebot

jq -r 'select(.user_agent|test("googlebot";"i")) | .request' access.jsonl | cut -d' ' -f2 | sort | uniq -c | sort -rn | head

2) Find recent 5xx responses from crawlers

jq -r 'select(.status>=500) | select(.user_agent|test("googlebot|bingbot";"i")) | {time:.time, status:.status, request:.request, host:.host}' access.jsonl | jq -s '.' | jq '.[0:200]'

3) Aggregate by hour to spot rate-limit windows

jq -r 'select(.user_agent|test("googlebot";"i")) | .time[0:13] as $h | {hour:$h, status:.status} ' access.jsonl \
  | jq -s 'group_by(.hour) | map({hour: .[0].hour, total: length, errors: (map(select(.status>=400)) | length)})'

Section C — Scalable triage using ClickHouse (OLAP)

ClickHouse enables fast, ad-hoc aggregation across massive log volumes. Below are sample SQL snippets appropriate for ClickHouse 2026 clusters. Adapt column names to your schema. Assumes a table access_logs with columns: event_time DateTime, host String, path String, status UInt16, user_agent String, method String, remote_ip String, req_headers String, res_headers String.

1) Top crawled paths by Googlebot, last 7 days

SELECT path, count(*) AS hits
FROM access_logs
WHERE event_time >= now() - INTERVAL 7 DAY
  AND lower(user_agent) LIKE '%googlebot%'
GROUP BY path
ORDER BY hits DESC
LIMIT 100

2) Status distribution for crawlers (by day)

SELECT toDate(event_time) AS day,
       status,
       count() AS cnt
FROM access_logs
WHERE event_time >= now() - INTERVAL 30 DAY
  AND (lower(user_agent) LIKE '%googlebot%' OR lower(user_agent) LIKE '%bingbot%')
GROUP BY day, status
ORDER BY day DESC, cnt DESC

3) Detect mass 429 events (rate-limiting spikes)

SELECT toStartOfMinute(event_time) AS minute,
       count() AS total,
       countIf(status = 429) AS hits_429
FROM access_logs
WHERE event_time >= now() - INTERVAL 48 HOUR
  AND lower(user_agent) LIKE '%googlebot%'
GROUP BY minute
HAVING hits_429 > 0
ORDER BY minute ASC

4) Which IPs are getting blocked or throttled most?

SELECT remote_ip, countIf(status = 429) AS throttled, countIf(status >= 500) AS server_errors, count() AS total
FROM access_logs
WHERE event_time >= now() - INTERVAL 7 DAY
  AND lower(user_agent) LIKE '%googlebot%'
GROUP BY remote_ip
ORDER BY throttled DESC, server_errors DESC
LIMIT 50

5) Find responses with X-Robots-Tag: noindex in headers

SELECT path, count() AS cnt
FROM access_logs
WHERE event_time >= now() - INTERVAL 30 DAY
  AND match(res_headers, '(?i)X-?Robots-Tag:.*noindex')
GROUP BY path
ORDER BY cnt DESC
LIMIT 100

6) Example DDL (create table)

CREATE TABLE access_logs
(
  event_time DateTime,
  host String,
  path String,
  method String,
  status UInt16,
  remote_ip String,
  user_agent String,
  req_headers String,
  res_headers String,
  bytes UInt64
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_time)

Practical triage playbook (step-by-step)

Run these steps in order when indexation drops or Search Console shows crawl errors.

  1. robots.txt & sitemap
    • CLI: curl -sS https://example.com/robots.txt | sed -n '1,120p'
    • Ensure sitemap lines exist. If sitemap points to a 404, bots won’t discover pages reliably.
  2. Search bot presence
    • Host logs: check for Googlebot/Bingbot in last 48 hours (grep examples above).
    • If absent, check DNS, firewall, CDN, or search console selectors.
  3. Response codes & throttles
    • Use the ClickHouse 48-hour 429 spike query to detect throttling windows.
    • Correlate with deploy times and WAF rule updates.
  4. Intentional exclusions
    • Search for X-Robots-Tag: noindex and <meta name="robots" content="noindex"> across responses (jq & ClickHouse examples).
  5. Canonical and redirect cascades
    • Spot redirect loops by finding many 3xx sequences for canonical URLs in logs.
  6. Automate checks in CI
    • Run a compact script (sample below) as part of deploy pipelines to fail when robots.txt blocks or X-Robots-Tag is set to noindex on production pages.

Sample CI check (bash)

# ci-crawl-check.sh
set -e
URLS=("https://example.com/" "https://example.com/product/123")
for url in "${URLS[@]}"; do
  if curl -sSL -I "$url" | grep -i "X-Robots-Tag:.*noindex" >/dev/null; then
    echo "ERROR: $url has X-Robots-Tag:noindex"; exit 1
  fi
  if curl -sS "$url" | grep -qi ']*name=["\']robots["\'][^>]*content=["\']?noindex'; then
    echo "ERROR: $url contains meta robots noindex"; exit 1
  fi
done
echo "CI crawl checks passed"
  • Move more logs to OLAP: ClickHouse and other OLAP systems became default for scale. Centralize logs (edge, CDN, origin) to avoid blind spots.
  • Index discovery matters: Sitemaps and sitemap indexes will remain critical for very large sites—automate verification that sitemaps are reachable and referenced in robots.txt.
  • Bot fingerprinting and anti-bot changes: Modern bot management may modify headers. Use canonical verification and Search Console reports to cross-check behavior.
  • CI integration: Run lightweight checks before and after deploys. Prevent regressions that accidentally expose noindex or block crawlers. Consider adding a Docker-based test harness for reproducible validation.

Actionable takeaways

  • Start with quick host-level grep and jq checks to unblock immediate issues.
  • Scale analysis with ClickHouse to find correlated patterns across hosts and time windows.
  • Automate these checks in CI to catch indexation regressions before they reach production.
  • Centralize logs (edge, CDN, origin) so your triage works across modern architectures.
“When in doubt, search the logs: they rarely lie. Combine fast-shell checks with OLAP queries to move from suspicion to root cause in minutes, not days.”

Where to go next

Use this cheat-sheet as a baseline. Customize the patterns for your log format and add specific checks for your CDN and WAF headers. If you run ClickHouse, dump one week of logs into a test_access_logs table and run the SQL snippets above to validate assumptions before moving to production queries.

Call to action

If you want a downloadable, CI-ready bundle of these commands (with a Docker-based test harness for ClickHouse), get the free CLI toolkit from our repo and a ready-to-run GitHub Actions workflow for pre-deploy crawl checks. Implement automated triage now and cut mean time to detect for crawlability issues by weeks.

Advertisement

Related Topics

#CLI#logs#tools
c

crawl

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-11T00:59:56.894Z