Building Scalable Blockchain Data Pipelines: Lessons from Processing 30+ Networks
Building Scalable Blockchain Data Pipelines: Lessons from Processing 50M+ Daily RPC Events
Building infrastructure that can reliably process data from 30+ different blockchain networks isn’t just about handling volume-it’s about managing the complexity of different protocols, consensus mechanisms, and data structures while maintaining real-time performance. Over the past year at Lava Network, I’ve architected and built a comprehensive data pipeline that processes millions of blockchain events daily. Here’s what I’ve learned.
The Challenge: Multi-Chain Complexity at Scale
When you’re building for a single blockchain like Ethereum, you can optimize for its specific characteristics. But when you need to support everything from Cosmos chains to Ethereum L2s to completely different architectures like Solana or Near, the challenge multiplies exponentially.
Key Challenges We Faced:
- Heterogeneous Data Formats: Each blockchain has its own event structure, transaction format, and block architecture
- Varying Block Times: From Ethereum’s ~12 seconds to Cosmos chains with 6-7 second blocks to high-frequency chains
- Different RPC Interfaces: Each chain has its own API patterns and data access methods
- Scalability Requirements: Processing millions of events while maintaining low latency for real-time analytics
- Reliability: No data loss across network disruptions, chain reorganizations, and system upgrades
Architecture Overview: The 4-Component Pipeline
After extensive prototyping and testing, we settled on a 4-component architecture that balances performance, reliability, and maintainability:
+---------------+ +----------------+ +-------------------+ +-------------+
| Ingest API |--->| Parser Worker |--->| Metrics Loader |--->| Stats API |
| (Gateway) | | (Processor) | | (Aggregator) | | (Frontend) |
+---------------+ +----------------+ +-------------------+ +-------------+
| | | |
v v v v
Chain RPCs Kafka Topics TimescaleDB Redis Cache
Component 1: Ingest API - The Universal Gateway
The ingest API serves as our universal gateway to blockchain data. Instead of building 30+ different data collectors, we created a unified interface that can adapt to any blockchain’s RPC pattern.
Component 2: Parser Worker - The Data Transformer
The parser worker is where the magic happens. It receives raw blockchain data via Kafka and transforms it into our unified data model while preserving chain-specific metadata.
Design Principles:
- Failsafe Architecture: Exponential backoff and retry mechanisms at each processing stage
- Queue-Based Communication: Dedicated message queues between all pod-to-service communications
- Error Isolation: Failed parsing of one event doesn’t affect others
- Batch Processing: Optimized for throughput while maintaining order guarantees
Component 3: Metrics Loader - The Aggregation Engine
TimescaleDB was chosen for its excellent time-series capabilities and PostgreSQL compatibility. The metrics loader performs real-time aggregations and builds the data structures that power our analytics.
Aggregation Strategies:
- Time-Based Windows: 1-minute, 5-minute, hourly, and daily aggregations
- Chain-Specific Metrics: RPC latency, block times, transaction throughput
- Cross-Chain Analytics: Comparative performance metrics and trend analysis
- Continuous Aggregates: Pre-computed rollups for fast query performance
Sample Aggregation:
CREATE MATERIALIZED VIEW hourly_chain_metrics
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 hour', timestamp) AS hour,
chain_id,
count(*) as total_transactions,
avg(gas_used) as avg_gas_used,
avg(block_time) as avg_block_time,
count(DISTINCT block_height) as blocks_processed
FROM blockchain_events
GROUP BY hour, chain_id;
Component 4: Stats API - The Performance Layer
The final component serves data to our various UIs and external consumers. Built with performance in mind, it combines TimescaleDB for historical data with Redis for real-time caching.
Performance Optimizations:
- Multi-Layer Caching: L1 (in-memory), L2 (Redis), L3 (TimescaleDB)
- Query Optimization: Pre-computed aggregates and intelligent query planning
- Connection Pooling: Efficient database connection management
Technology Deep Dive
Why Bun Runtime?
We chose Bun as our JavaScript runtime for several compelling reasons:
- Performance: 3-4x faster startup times compared to Node.js
- Built-in Tools: Native bundling, testing, and package management
- Modern Standards: First-class TypeScript support without compilation overhead
- Memory Efficiency: Lower memory footprint for our worker processes
Performance Comparison:
# Cold start times (average of 100 runs)
Node.js: 2.3s
Bun: 0.6s
# Memory usage (steady state)
Node.js: 185MB per worker
Bun: 127MB per worker
Kafka for Event Streaming
Kafka provides the backbone for our event streaming with several key configurations:
# Optimized for throughput and reliability
batch.size: 1048576 # 1MB batches
linger.ms: 100 # Small latency for responsiveness
compression.type: snappy
acks: all # Ensure durability
retries: 2147483647 # Infinite retries
Topic Strategy:
raw-events-{chainId}: Raw blockchain data per chainprocessed-events: Unified processed eventsmetrics-updates: Real-time metric updatesalerts: System health and anomaly notifications
TimescaleDB Optimizations
Partitioning Strategy:
-- Partition by time and chain for optimal query performance
SELECT create_hypertable(
'blockchain_events',
'timestamp',
partitioning_column => 'chain_id',
number_partitions => 32
);
-- Optimize for time-range queries
CREATE INDEX ON blockchain_events (timestamp DESC, chain_id);
CREATE INDEX ON blockchain_events (chain_id, timestamp DESC);
Compression Policies:
-- Compress data older than 7 days
SELECT add_compression_policy('blockchain_events', INTERVAL '7 days');
Performance Metrics & Results
After 6 months in production, our pipeline consistently delivers impressive performance:
Throughput Metrics
- Peak Processing Rate: 50,000 events/second
- Average Latency: End-to-end processing under 2 seconds
- Storage Efficiency: 80% compression ratio on historical data
- Query Performance: 95th percentile queries under 200ms
Reliability Metrics
- Uptime: 99.9% availability
- Data Accuracy: 100% - zero data loss incidents
- Recovery Time: Under 60 seconds for component failures
- Scalability: Linear scaling up to 100 worker processes
Resource Utilization
Component CPU Usage Memory Usage Storage
Ingest API 15-25% 2GB -
Parser Workers 40-60% 8GB -
Metrics Loader 20-35% 4GB -
Stats API 10-20% 1GB -
TimescaleDB 30-50% 16GB 2TB
Redis 5-15% 4GB -
DragonflyDB 3-8% 2GB -
Kafka 20-30% 6GB 500GB
Note: We use Redis as primary cache with DragonflyDB as a high-performance replica for read-heavy workloads, providing enhanced throughput and reduced latency for analytics queries.
Advanced Technical Implementation
Dual-Pool Database Architecture
The system implements a sophisticated Read/Write Separated Connection Pool architecture for optimal database performance:
// Read Pool - Optimized for query performance
readPool: {
max: 20, // Maximum connections
min: 5, // Minimum connections
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 10000,
maxUses: 7500, // Connection recycling
query_application_timeout: 120000 // 2 minutes
}
// Write Pool - Optimized for write throughput
writePool: {
max: 10, // Fewer connections for writes
min: 2,
maxUses: 7500,
query_application_timeout: 120000
}
Latency Distribution System
Implemented uncapped latency tracking with 7-bucket classification system for internal analytics:
Latency Buckets:
- 0-100ms: Fast responses
- 100-200ms: Good responses
- 200-300ms: Acceptable responses
- 300-400ms: Slow responses
- 400-500ms: Very slow responses
- 500-1000ms: Outlier responses
- 1000ms+: Critical outliers
Daily Aggregation: Each chain/day record stores 14 additional columns (count + sum per bucket) enabling precise latency distribution analysis while maintaining public API backward compatibility with 500ms capped values.
Kafka Optimization Strategy
Batch Size Hierarchy:
CloudFlare Logs (70,000+ entries)
└─> SendEntries splits (20,000 entries/call)
└─> BatchManager buffers (15,000 default, max 20,000)
└─> Kafka chunks (5,000 entries)
└─> Final safety batches (2,000 entries)
Producer Configuration:
- Message max: 1GB (1,073,741,824 bytes)
- Batch size: 10MB (10,485,760 bytes)
- Linger: 10ms for optimal batching
- Socket buffer: 1GB for high throughput
Consumer Optimization:
- 10 partitions for parallel processing
- 100MB max fetch per partition
- 20 parallel workers per consumer
- Circuit breaker protection with exponential backoff
PostgreSQL Performance Tuning
Aggressive Memory Configuration:
shared_buffers: 8GB # 2x increase for better caching
maintenance_work_mem: 4GB # 4x increase for faster aggregations
work_mem: 256MB # 4x increase for complex queries
wal_buffers: 256MB # 4x increase for write performance
Parallel Processing:
- 16 max worker processes
- 8 parallel workers per query
- 8 parallel maintenance workers
- 300 effective I/O concurrency
Query Optimization:
- Statistics target: 500 (5x default)
- Random page cost: 1.0 (SSD optimized)
- CPU costs reduced for modern hardware
- Genetic query optimization for complex joins
Circular Buffer Management
Real-time Data Storage:
// Fixed-size buffers with FIFO eviction
buffers: Map<string, CircularBuffer<LogEntryData>>
geoBuffers: Map<string, CircularBuffer<GeoPoint>>
// Buffer configuration
- Size: 5,000 entries per buffer
- Memory: ~2MB per buffer
- TTL: 1 hour automatic expiration
- Thread-safe operations
Buffer Types:
- Global entries (all chains)
- Chain-specific entries (per blockchain)
- Provider-specific entries (per RPC provider)
- Geographic point aggregation (lat/lon data)
Conclusion
Building scalable blockchain data infrastructure is a complex challenge that requires careful consideration of performance, reliability, and cost. The key is to design for the inherent complexity and unpredictability of the blockchain ecosystem while maintaining the flexibility to adapt to rapid protocol evolution.
Our architecture at Lava Network has proven that it’s possible to process massive volumes of multi-chain data in real-time while maintaining high reliability and performance. The combination of modern technologies like Bun, Kafka, and TimescaleDB, coupled with thoughtful system design, enables us to provide valuable insights across the entire blockchain ecosystem.
Whether you’re building analytics platforms, DeFi protocols, or blockchain explorers, these patterns and principles can help you create robust, scalable infrastructure that can handle the demands of modern multi-chain applications.