How we scale Radar?

Discover how we’ve pushed the boundaries of fraud detection by scaling Radar, leveraging bleeding edge technologies to stay ahead of the curve.

Scaling Pandabase to New Heights

Origins and Challenges

By early 2022, our Node.js-based system hit significant scalability limits:

  1. P99 latency exceeded 300ms during peak times, reducing our capacity to process transactions.
  2. CPU utilization spiked unpredictably, making it difficult to optimize resources.
  3. Memory usage grew linearly with transaction volume, leading to potential scaling limits.
  4. Feature computation created a bottleneck, delaying downstream processing.

We needed a new system that could handle a 10x increase in volume, maintain latency under 100ms, and work at a massive scale. This journey led us to a distributed system built around high-performance languages, specialized databases, and rigorous design.


Choosing Rust for Core Processing

The most critical part of Radar is the transaction processing engine, which makes real-time fraud detection decisions. For this, we chose Rust for its high performance, memory safety, and concurrency model. Rust’s async capabilities, strong type system, and zero-cost abstractions enabled us to build a highly optimized core.

Core Transaction Processor

Our Rust-based processor does more than a simple call; it analyzes thousands of data points from user history to behavioral patterns within milliseconds. The TransactionProcessor architecture has three primary components:

pub struct TransactionProcessor {
    feature_store: Arc<FeatureStore>,
    risk_engine: Arc<RiskEngine>,
    metrics: Arc<Metrics>,
}

Each component is meticulously optimized for concurrency and latency:

  1. Feature Store: Extracts essential data from various sources to compute risk features.
  2. Risk Engine: A probabilistic model evaluates risk based on features.
  3. Metrics: Captures key metrics, such as processing time, to help us understand system performance.

The evaluate method processes each transaction as follows:

impl TransactionProcessor {
    pub async fn evaluate(&self, tx: Transaction) -> Result<Decision> {
        let start = Instant::now();

        // Parallel feature computation
        let features = self.feature_store
            .compute_features(&tx)
            .with_timeout(Duration::from_millis(50))
            .await?;

        let decision = self.risk_engine
            .evaluate(features)
            .with_timeout(Duration::from_millis(30))
            .await?;

        self.metrics.record_latency(start.elapsed());
        Ok(decision)
    }
}

Key Improvements:

  • Memory Efficiency: Rust’s ownership model prevents memory leaks and avoids garbage collection pauses, which are common in high-throughput systems.
  • Predictable Latency: Async execution allows us to run multiple operations concurrently without blocking the main thread.
  • CPU Utilization: Rust’s low-level capabilities make it easy to fine-tune CPU usage, resulting in stable and efficient resource usage.

Impact:

  • Reduced P99 latency from 300ms to 95ms.
  • A single Rust service replaced five Node.js instances, cutting resource costs and complexity.

Service Infrastructure with Go

Rust powers the engine, but Go handles the supporting infrastructure, acting as Radar’s gateway. Go’s goroutine model and networking capabilities make it ideal for building a responsive and robust service mesh.

Gateway Design

Our gateway orchestrates each transaction through various subsystems. Here’s a simplified example of the gateway in Go:

type Gateway struct {
    limiter    *rate.Limiter
    processor  *processor.Client
    monitor    *metrics.Monitor
}

func (g *Gateway) ProcessTransaction(ctx context.Context, tx *pb.Transaction) (*pb.Decision, error) {
    if !g.limiter.Allow() {
        return g.fallbackDecision(ctx, tx)
    }

    decision, err := g.processor.Process(ctx, tx)
    g.monitor.RecordLatency(ctx, time.Since(start))
    return decision, err
}

The gateway performs several essential functions:

  1. Rate Limiting: Prevents overload by throttling requests.
  2. Circuit Breaking: Detects failures and prevents cascading errors across the system.
  3. Metrics Collection: Tracks key performance metrics for analytics and alerting.
  4. Fallback Strategies: Ensures continuity, even during failures, by implementing default responses.

Advantages:

  • Network Efficiency: Go’s efficient networking model handles high traffic with minimal latency.
  • Easy Deployment: Go’s static binaries and minimal dependencies make deployment straightforward.

Data Architecture: Five Databases with Specific Roles

A major architectural choice was the use of five different databases. Each serves a unique purpose in maintaining both performance and accuracy at scale.

1. PostgreSQL: The System of Record

PostgreSQL is the primary transaction store, handling structured and relational data. Our schema design leverages partitioning and indexing to ensure high performance:

CREATE TABLE transactions (
    id uuid PRIMARY KEY,
    user_id uuid NOT NULL,
    amount numeric NOT NULL,
    risk_score float NOT NULL,
    created_at timestamptz DEFAULT now()
) PARTITION BY RANGE (created_at);

Key Features:

  • Partitioned Storage: Efficient data storage by partitioning records based on creation time, making cleanup easier.
  • Optimized Indexing: Carefully crafted indexes speed up common queries.
  • Regular Maintenance: Scheduled VACUUM and ANALYZE operations to prevent data bloat.

2. KeyDB: Real-time Feature Computation

KeyDB, a high-performance Redis alternative, handles real-time feature calculations. Our choice of KeyDB over Redis was driven by KeyDB’s multi-threading capabilities, which are crucial for high-throughput and low-latency demands.

Example Feature Computation:

pub async fn compute_velocity(&self, user_id: &str) -> Result<VelocityFeatures> {
    let mut pipe = redis::pipe();

    for window in &[5, 15, 60] {
        let key = format!("v:{}:{}", user_id, window);
        pipe.incr(&key, 1)
            .expire(&key, window * 60);
    }

    let results = pipe.query_async(&mut self.conn).await?;
    Ok(VelocityFeatures::from_results(results))
}

Why KeyDB:

  • Multi-threading: Handles parallel operations more efficiently than Redis.
  • Memory Efficiency: Optimized for memory usage under high concurrency.
  • Redis Compatibility: KeyDB’s compatibility with Redis clients allowed for easy integration.

3. ClickHouse: Real-time Analytics Engine

ClickHouse powers Radar’s real-time analytics. With efficient storage and compression, ClickHouse manages billions of records and delivers sub-second query times for analysis.

Features:

  • Data Compression: Stores large volumes of data compactly.
  • Real-time Aggregations: Enables instant insights from current and historical data.
  • Low-latency Queries: Achieves fast read performance for analytics and reporting.

4. CouchDB: Pattern Storage

We use CouchDB for storing fraud patterns and machine learning models. CouchDB’s document model and support for multi-master replication make it ideal for pattern storage.

Benefits:

  • Document Flexibility: Allows dynamic schema evolution as patterns change.
  • Real-time Replication: Propagates pattern updates instantly across all instances.
  • Complex Matching: Supports the pattern-matching algorithms essential for fraud detection.

5. Redis: Distributed State Management

Redis plays a supporting role, managing Radar’s distributed state for critical functions such as session management, feature coordination, and event aggregation.

Advantages:

  • Session Storage: Maintains user state and context across distributed transactions.
  • Distributed Locking: Coordinates actions across multiple servers, preventing race conditions.
  • Event Aggregation: Consolidates events from multiple sources, allowing Radar to react to complex scenarios in real-time.

Operational Learnings from Production

Database Synchronization Across Five Systems

Synchronizing data across five databases posed a challenge. We built a synchronization service in Rust, ensuring atomicity and consistency across systems.

pub struct DataSync {
    postgres: Arc<PostgresStore>,
    clickhouse: Arc<ClickHouseStore>,
    redis: Arc<RedisClient>,
}

impl DataSync {
    pub async fn sync_transaction(&self, tx: Transaction) -> Result<()> {
        let tx_id = self.postgres.store(tx.clone()).await?;

        // Parallel sync to other stores
        try_join!(
            self.clickhouse.store(tx.clone()),
            self.redis.update_features(tx)
        )?;

        Ok(())
    }
}

Error Handling Strategy

We designed custom error types in Rust to capture, propagate, and respond to issues seamlessly:

#[derive(Error, Debug)]
pub enum ProcessorError {
    #[error("feature computation timed out after {0}ms")]
    FeatureTimeout(u64),

    #[error("database error: {0}")]
    DatabaseError(#[from] DbError),

    #[error("risk engine error: {0}")]
    RiskEngineError(#[from] EngineError),
}

Advanced Monitoring and Observability

To keep Radar reliable and efficient, we designed a comprehensive monitoring strategy that captures system health, performance, and bottlenecks in real time. Observability tools include Prometheus for time-series metrics, Grafana for visualization, and Jaeger for tracing.

Here’s a look at how we record metrics for each transaction:

pub async fn record_metrics(&self, tx: &Transaction, result: &Decision) {
    self.prometheus
        .histogram_vec("transaction_duration_ms")
        .with_label_values(&[
            &tx.region,
            &result.to_string()
        ])
        .observe(duration.as_millis() as f64);
}

Metrics Tracked:

  • Transaction Latency: Monitors the time taken to evaluate each transaction.
  • Error Rates: Tracks frequency and types of errors for quick detection and response.
  • Database Performance: Monitors latency and throughput for each database to identify potential slowdowns or bottlenecks.
  • Service Health: Uses alerts and dashboards for proactive monitoring of CPU, memory, and network performance across the service.

Results After One Year in Production

After deploying the redesigned Radar system and running it in production for a year, we achieved significant improvements:

  • Uptime: 99.99% uptime across all services, ensuring reliable fraud detection.
  • P99 Latency: Maintained sub-100ms latency, a critical factor for real-time fraud prevention.
  • Transaction Volume: Scaled to process over 50 million transactions daily without significant performance degradation.
  • Fraud Reduction: Achieved a 43% decrease in fraud through better accuracy and faster detection.
  • Cost Efficiency: Reduced operating costs by 67% through optimized resource allocation and efficient infrastructure.

Future Developments

While the system is robust, we continue to innovate and add features to stay ahead of emerging fraud tactics. Our roadmap includes:

  1. Enhanced Graph-based Fraud Detection: Leverage graph databases to detect complex fraud patterns involving networks of accounts.
  2. Real-time Model Updates: Enable on-the-fly updates to machine learning models without redeployments, keeping fraud detection models current.
  3. Improved Pattern Matching: Enhance our CouchDB pattern storage with faster matching algorithms.
  4. Advanced Device Fingerprinting: Improve user profiling and detection accuracy by adding device-level analysis.
  5. Extended Behavioral Analysis: Use behavioral data to build detailed profiles that detect anomalies in user behavior patterns.

Key Takeaways

  1. Use the Right Tool for Each Job: Select languages and databases based on strengths, not popularity.
  2. Optimize for Observability: Build monitoring and metrics into the design from the beginning to ensure proactive responses.
  3. Balance Complexity and Value: Adopt complex, polyglot architectures only when the value justifies it.
  4. Prioritize Developer Experience: Optimize workflows to make development and maintenance easier for teams.
  5. Design for Scale and Flexibility: Building a scalable and adaptable system allows for easier enhancements and long-term viability.

Radar exemplifies a scalable, resilient, and purpose-built system. Each component is crafted for high-throughput, low-latency fraud detection, enabling us to meet today’s demands while preparing for tomorrow’s challenges.

autor-image Aelpxy
Back