Watch a demoFree trial
Blog
Blog
BlogProductCase studiesCompany news
Blog

Application monitoring & logging: A developer's guide to taking control

observabilityDevOpsperformanceautomation
18 August 2025
Share

No developer wants 3 AM urgent server outages and performance alerts, but that's exactly what can happen without solid application monitoring. Poor performance and outages directly impact your revenue.

Akamai and SOASTA’s research reveals the brutal economics of performance problems: even a mere 100ms increase in latency slashes conversion rates by 7%, while extended downtime can hemorrhage thousands of dollars hourly. This direct link between technical performance and business outcomes is precisely why your monitoring stack must evolve beyond basic tracking to become an early warning system, catching and resolving issues before they ever reach your users' screens.

Application performance monitoring means stopping issues before they start. Well-configured service level objectives (SLOs) turn potential disasters into minor fixes.

"The best time to fix a bug is before it exists." - Andrew Hunt, co-author of The Pragmatic Programmer

In this guide, we'll explore these key areas:

  • Early warning systems that catch issues at the onset
  • Precise alerts that identify root causes
  • Data-driven maintenance planning
  • Battle-tested tools that fit your stack

Why monitor? The real cost of downtime

What hurts more than a 3 AM wake-up call, watching revenue fall after systems go down. Here's how smart monitoring stops fires before they start:

  • See it break: Users won't tell you when things slow down - they'll ghost you
  • Fix faster: Precise system insights can significantly cut your resolution time
  • Scale intelligently: Monitor capacity metrics to scale resources based on actual demand
  • Data-driven decisions: Finally, measure the real impact of your technical choices

Proactive security: Don't wait for breaches to happen. Detect and block threats in real-time with continuous monitoring

 

The three core elements of application monitoring

Application monitoring keeps your systems honest. Three pieces work together to tell you what's happening under the hood. 

1. Metrics: Track key health indicators

  • Watch your system's health in real-time–catch small hiccups before they turn into real headaches
  • Monitor key indicators like CPU usage, response times and error rates
  • Set smart alerts that notify you when metrics drift outside acceptable ranges

2. Logs: Your system's digital paper trail

  • Capture detailed records of every significant event across your application
  • Log everything that moves across your system. Every error, warning, and user action. It's your digital paper trail when things go wrong.
  • Debug issues with complete context and timestamp data

3. Traces: Your request's journey map

  • Follow requests as they travel through your distributed services
  • Find performance bottlenecks and improve service connections
  • See how different parts of your app work together, from API calls to database queries and service communication

These three pieces work together as your monitoring foundation, helping you catch problems early and keep everything running smoothly. You know what's great? Modern cloud platforms have made it pretty easy to set up these monitoring tools–no rocket science needed. Lets look at how in the next section.

Monitoring in the cloud era

Cloud monitoring gives you a direct line between system health and business impact. Real-time monitoring lets you catch and fix small hiccups before they snowball into big problems that affect your users. You can optimize your system based on actual performance metrics.

Modern container platforms like Kubernetes have changed how we look at data and offer granular insights into every component of our applications. You get granular insights while maintaining the full context of your system.

Multi-runtime architecture: Optimizing monitoring across your stack

Modern applications are versatile and adaptable in that they use different specialized tools (called runtimes) to get various jobs done efficiently. Rust powers the heavy lifting, Node drives web API,s and Python crunches data. Each runtime needs its monitoring approach to catch problems early— think of it like having specialized doctors for different parts of your system.

RuntimeRequests/secMemory PatternMonitoring FocusCommon Use Case
Rust690Static allocationThread metrics, memory safetyHigh-performance services
Node.js100Event-drivenEvent loop, async operationsAPI services, real-time apps
Python50Reference countingGIL contention, memory usageData processing, ML services

Runtime-specific monitoring strategies:

This strategy helps maintain optimal performance across your entire application while reducing mean time to resolution when problems pop up.

  • Node.js monitoring
    • Track event loop lag (warning threshold: >100ms)
    • Monitor heap usage patterns with tools like clinic.js
    • Example config: { "eventLoopLag": { "warning": 100, "critical": 500 }, "heapUsage": { "warning": "70%", "critical": "85%" } }
       
  • Python monitoring
    • Watch for GIL contention using py-spy
    • Track memory leaks with memory_profiler
    • Example config: { "gilContentionRate": { "warning": "25%", "critical": "40%" }, "memoryGrowth": { "warning": "10MB/min", "critical": "50MB/min" } }
       
    • Rust monitoring
      • Monitor thread pool usage to maintain service responsiveness
      • Use metrics-rs to track system resources.
      • Example config: { "threadPoolUtilization": { "warning": "85%", "critical": "95%" }, "requestLatency": { "warning": "10ms", "critical": "50ms" } }

Upsun makes working with multiple runtimes less complex by giving you visibility into your entire stack from a single place.  You can define and manage dependencies between services, helping you identify potential bottlenecks. Our unified monitoring lets you track performance across all technologies.

Numbers that drive decisions

Here are the key metrics to watch, along with their thresholds and what to do when things go wrong:

1. Performance metrics (speed & responsiveness)

MetricNormalWarningCritical
Response Time< 200ms> 500ms> 1000ms
Error Rate< 0.1%> 1%> 5%

Response time monitoring: Track request response speed—it's your user experience pulse.

  • When the warning threshold hits: Profile slow queries (using database profiling tools), optimize code paths, and check network latency (with traceroute, mtr).
  • When the critical threshold hits: Deploy APM diagnostics. Trace bottlenecks. Check timeouts. Monitor resources. [2025-02-13 12:15:23] [WARN] Slow query detected - Endpoint: /payment - Duration: 650ms - Query: SELECT * FROM orders WHERE user_id = ?

Error rate monitoring: Track requests that fail to keep your service running smoothly.

  • When the warning threshold hits: Analyze error patterns across your stack (using error tracking tools).
  • When the critical threshold hits: Check recent deploys and infrastructure changes. [2025-02-13 12:30:00] [ERROR] POST /api/users - 500 Internal Server Error - Reason: Database connection failed

 

2. System vitals

MetricNormalWarningCritical
CPU Usage< 70%> 70% (5+ min)> 85% (2+ min)
Memory Usage< 75%> 85%> 95%

CPU monitoring: find performance bottlenecks early and fix them before your users notice anything's wrong.

  • When the warning threshold hits: Profile CPU-heavy processes (using top, htop), optimize algorithms.
  • When the critical threshold hits: Scale compute capacity, optimize critical execution paths. [2025-02-13 14:20:33] [WARN] High CPU Usage - Service: API - Usage: 82%

Memory monitoring: tracking memory usage finds memory leaks before they tank your app. No crashes, no surprises.

  • When the warning threshold hits: Run memory profilers to check heap dumps and garbage collection.
  • When the critical threshold hits: hunt down memory leaks and restart if needed. [2025-02-13 14:45:10] [WARN] Memory Pressure - Service: BackgroundJobProcessor - Heap Usage: 90%

Disk I/O monitoring: Monitor both how fast your storage can handle operations (IOPS) and how much data it can move at once (throughput) to keep things running well.

  • Set baselines for your specific workloads (using iostat, iotop)
  • Alert on significant pattern changes
  • Monitor database and file operations closely and tune them when performance starts to lag

3. Service level objectives (SLOs)

Availability tracking: Define measurable SLOs that quantify your service performance. Track latency, error rates, and uptime.

  • Target: Target 99.95% uptime. Track error budget (the allowable amount of downtime or errors before breaching service level agreements).
  • On breach: Trigger automated scaling when error budget hits 90% to proactively manage capacity. [2025-02-13 15:00:00] [LOG] SLO breach detected - Error budget: 92% - Region: US-East - Status: Initiating capacity analysis

Performance SLOs: Define response time targets that match what users expect and keep measuring against them.

  • Target: Keep p99 latency (99th percentile response time) under 500ms to ensure most users have a fast, smooth experience.
  • On breach: Scale up resources and optimize code paths to restore service levels

4. Real user metrics that matter

  • Page load time:
    • Normal: < 2s
    • Warning: > 3s
    • Critical: > 5s (Users leaving)
    • Action steps:

      • Profile frontend performance (using Lighthouse, WebPageTest)
      • Optimize assets and API calls
      • Use network testing tools to identify and resolve performance bottlenecks
      • [2025-02-13 16:10:22] [WARN] Slow Page Load - URL: /products - Average Load Time: 4.2s
  • User flow completion: Track successful user journeys.
    • Normal: > 95%
    • Warning: < 90%
    • Critical: < 85%
    • Action steps:

      • Parse funnel analytics
      • Scan error logs
      • Review session data for drop-off points
      • [2025-02-13 16:30:45] [WARN] Cart Abandonment Spike - User Journey: Checkout - Completion Rate: 88%

Track these user-focused metrics consistently and take swift action when warning signs appear; you'll keep solid service delivery and prevent user frustration. When you combine these insights with your system monitoring and logging, you get an observability picture for both technical excellence and business success. With this foundation, let's explore how to protect your systems in real-time with robust security monitoring.

5. Protect your systems in real-time

Security monitoring works alongside performance tracking to shield your systems from threats. Here's your practical security toolkit:

  • Smart detection
    • Block suspicious IPs after 10 failed login attempts per minute
    • Alert when traffic spikes 3x above normal within 5 minutes

Scan security events every 30 seconds to spot attack patterns in real-time - your first defense against emerging threats
 

Attack TypeDetection MethodResponse
Credential abuseLogin velocity per IPProgressive rate limiting
Data theftUnusual data patternsNetwork isolation
  • Active defense
    • Rate limiting that adapts to your traffic patterns
    • Instant blocking of malicious activity
    • Real-time threat data integration
  • Quick response playbook
    • Standardized threat handling with MITRE ATT&CK
    • Automated response sequences

Response time targets:

Threat typeResponse timeAction required
Brute force< 5 minBlock IPs, alert security
Data breach< 15 minIsolate affected systems
DDoS< 10 minScale defenses, filter traffic
Zero-day< 30 minPatch and update systems

6. Business metrics

Your application's technical metrics directly impact your bottom line. Here's how to translate performance data into business value:

MetricNormalWarningCriticalAction
Transaction Success≥99%98-99%<95%Check payment systems, API health
Revenue Loss$0$1-5K/hr>$10K/hrActivate incident response

Response matrix for transaction issues:

  • 98-99%: Monitor closely
    • Review system logs
    • Check recent deployments
  • 95-98%: Immediate action
    • Analyze payment gateway health
    • Review traffic patterns
  • <95%: Critical response
    • Activate the incident team
    • Execute rollback procedures

Key takeaway: Track these metrics in real-time and adjust thresholds based on your business patterns. Quick response to degradation prevents a major revenue impact.

Logs: 

Every event, error, and user action gets captured in logs, giving you the complete story when things go wrong. They're your first line of defense for production issues.

Why structured logs matter

  • Security alerts: Catch suspicious patterns early
  • System clarity: See how services interact in real-time
  • Early detection: Fix issues before users report them

What this means for you:

  • Fast searches: Zero in on issues using request_iduser_iderror_type, or timestamp
  • Spot patterns: Catch recurring issues before they become problems
  • Automate responses: Set up alerts that trigger when things need attention
  • Clean data: Parse logs consistently across your entire stack

Pro logging practices

  • Pick your log levels wisely:
    • DEBUG: Dev-only details (keep out of prod)
    • INFO: Normal system events
    • WARN: Problems that need attention
    • ERROR: Recoverable failures
    • CRITICAL: Drop everything and fix now
  • Essential fields in every log:
    • request_id: Link events across services
    • timestamp: UTC for global tracking
    • user_id: Who triggered this
    • service_name: Where it happened
    • log_level: How urgent is it
  • Keep it clean: Use one format everywhere
  • Store it smart: Centralize in ELK or cloud logging with retention plans that fit your needs

Example: structured logging in Python

`import logging import json
logger = logging.getLogger(name)
def process_order(order_id, user_id, product_name, quantity): logger.info("Processing order", extra={ "order_id": order_id, "user_id": user_id, "product_name": product_name, "quantity": quantity, "event_type": "order_processing" # Adding context for analysis }) try: # ... order processing logic ... logger.info("Order processed successfully", extra={ "order_id": order_id, "status": "success", "event_type": "order_completion" }) except Exception as e: logger.error("Error processing order", extra={ "order_id": order_id, "error_message": str(e), "event_type": "order_error" }, exc_info=True) # Include stack trace for errors raise

 

Example usage

process_order("ORD-12345", "user-42", "Awesome Widget", 2)`

 

Example log output (JSON):

{"asctime": "2025-02-13 18:00:00", "levelname": "INFO", "name": "__main__", "message": "Processing order", "order_id": "ORD-12345", "user_id": "user-42", "product_name": "Awesome Widget", "quantity": 2, "event_type": "order_processing"} {"asctime": "2025-02-13 18:00:01", "levelname": "INFO", "name": "__main__", "message": "Order processed successfully", "order_id": "ORD-12345", "status": "success", "event_type": "order_completion"}

 

When you need structured logging in other programming languages, here are some solid options:

  • Winston for Node.js formats logs and routes them through multiple destinations
  • Serilog for .NET provides strongly-typed logging with good performance
  • Logrus in Go brings structured logging up a level with rich fields and hooks

Each of these libraries provides a consistent way to capture what your code does as it runs. 

Distributed tracing: connect your stack end-to-end

When requests flow through multiple services, you need clear insights into what happens where. OpenTelemetry makes this simple by linking your components together so you can find and fix issues quickly.

from opentelemetry import trace
from opentelemetry.trace import Status

tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span("process_order") 
def process_order(order_id):
    with tracer.start_span("validate_order") as span:
        # Add business context
        span.set_attribute("order_id", order_id)
        
        if not is_valid(order_id):
            span.set_status(Status.ERROR)
            return False
    
    return True

 

Traces show you exactly what's happening when things go wrong across your system helping you find and fix issues faster.

Choose the right monitoring tools

Your choice of monitoring tools directly determines how well you’re able to detect and resolve issues. Here's a clear and practical guide to picking tools that work for your stack.

1. Tool selection checklist

  • What's your technical comfort level? Command-line tools vs simpler interface
  • Where's your app running? Cloud or hybrid setup
  • What can you spend? Open source to enterprise
  • What's needed? Basic tracking or deep insights

2. APM tools

Quick setup, full visibility

  • ✓ Zero setup headaches
  • ✓ Ready-to-go monitoring tools
  • ✓ Start tracking in minutes
  • ✓ Built for production loads
  • ✗ Pay more as you grow
  • ✗ Hard to customize for specific needs
  • ✗ You're tied down to one vendor's way of doing things

3. DIY monitoring 

For the control enthusiasts

  • ✓ Build it exactly how you want
  • ✓ Backed by the open source community
  • ✓ Minimize costs by keeping  infrastructure low
  • ✓ Swap and replace tools without friction
  • ✗ Requires deep technical know-how
  • ✗ You own the monitoring stack
  • ✗ Scale it yourself

4. Cloud-native monitoring

Ideal when you're all-in on one cloud

  • ✓ Works with your platform
  • ✓ Quick to set up
  • ✓ Often included in platform costs
  • ✓ One dashboard for everything
  • ✗ Tied to your platform
  • ✗ Limited tweaking options
  • ✗ Features vary by platform

Quick comparison

FeatureSaaS APMOpen SourcePlatform-Native
SetupQuick (Datadog, Sentry)Complex (Prometheus + Grafana)Built-in (AWS CloudWatch)
ControlLimitedFullPlatform-based
ScaleAutomaticManualPlatform-tied
CostUsage-basedInfrastructure + MaintenancePlatform-included
MaintenanceManagedSelf-managedPlatform-managed
IntegrationPre-built connectorsCustom implementationNative tools
Best ForFast deploymentComplete flexibilityPlatform alignment

Tip: Mix tools based on what you monitor. Use SaaS APM for core metrics plus specialized tools for specific needs.

Pick your monitoring stack based on what matters most:

  • DIY route: When you need granular control and custom compliance features
  • SaaS APM tools: When you want monitoring that just works and scales with you
  • Cloud-native tools: When your apps live in one cloud and you want everything integrated

Actionable monitoring: turn data into decisions

Now that we've covered monitoring tools, let's put that data to work. Here's how to transform metrics into automated actions:

1. Smart alerts that power action

Configure smart alerts that turn data into action:

  • Impact-aware: Focus on user and business-critical metrics
  • Precision targeting: Route alerts to service owners instantly
  • Rich context: Include actionable troubleshooting data
  • Dynamic thresholds: Use dynamic baselines to reduce noise

Alert and runbook templates:

Response time > 500ms (5min)
actions:

  •   check_cache_hit_ratio
  •   verify_db_connections
  •   scale_service_pods

Error rate > 1% (1min)
actions:

  •   inspect_error_logs
  •   verify_dependencies
  •   rollback_if_needed

CPU load > 80% (3min)
actions:

  •   analyze_resource_usage
  •   optimize_queries
  •   add_capacity

Health check failures
actions:

  •   verify_endpoints
  •   check_certificates
  •   restart_if_unresponsive

2. Automation

Let's implement automated monitoring workflows:

  • Alert routing:
    • Smart routing sends critical issues straight to the right teams through PagerDuty/Slack
    • Auto-escalate alerts based on severity and response times
    • Route alert types through custom paths
  • Issue tracking:
    • Auto-create tickets with full context and stacktrace in 2s flat
    • Monitor metrics and behavioral telemetry to surface emerging patterns (sub-100ms response time)
  • Resource scaling:
    • Auto scale based on real performance data
  • Release control:
    • Roll back deployments and canary releases if the metric starts heading in the wrong direction.
# Alert configuration example
alerts:
  - name: "High API Response Time"
    metric: "http_request_duration_seconds_p95"
    threshold: 500ms
    duration: "5m"
    severity: "warning"
    route_to: "dev-team-channel"
    notification_type: "slack"

  - name: "Critical Service Down"
    metric: "service_health_check_failed"
    service: "payment-service" 
    severity: "critical"
    route_to: "on-call-pager"
    notification_type: "pagerduty"
    actions:
      - "auto_restart_service"
      - "create_jira_ticket"

 

3. Dashboards that cut through noise

With automation and smart alerts in place, let's create focused dashboards get into critical insights. Here's what we're working with:

  • Core metrics: Track real-time system health and performance metrics
  • Performance data: pinpoint and resolve performance issues before they affect your users
  • Debug toolkit: Drill into metrics when things go wrong
  • Role-based views: Role-based views: create tailored dashboards specific to what each team needs to see

Effective monitoring lets you detect and resolve technical issues before they impact your users.

Cost control

  • Problem: Monitoring gets more expensive as you grow
  • Solution: Track costs and optimize spending by monitoring data storage and usage metrics.

Track key metrics:

Monthly Cost = (Data Points × Storage Time) + (Log Volume × Rate) + (Query Usage × Price)

  • Budget alerts that bite early
  • Ruthless retention policies
     

Evolve your monitoring

Like the applications they track, monitoring systems need to grow and adapt over time. Static monitoring setups quickly become outdated as your architecture evolves, new services are added, and business priorities shift. Exceptional monitoring isn't a one-time implementation—it's an ongoing practice that continually delivers increasing value.

Set a quarterly cadence to evaluate your entire observability approach, examining which metrics provide actionable insights and which generate noise. This consistent review cycle ensures your monitoring tools detect the issues that matter most to your current architecture and business goals, rather than solving yesterday's problems.

The most mature engineering organizations treat their monitoring configurations with the same care as application code—versioned, tested, and continuously improved. By approaching monitoring as a living system rather than a static setup, you'll build observability that remains relevant and valuable even as your technical setup transforms.

Your monitoring checklist: a practical starting point

Let's build your monitoring strategy based on your application's needs. Start with these foundational steps:

  • Map key user flows. Document your critical business transactions and user paths
  • Monitor core metrics: Track response times, CPU memory, and errors
  • Add structured logging: Use JSON with request IDs for tracing
  • Set basic monitoring: Configure uptime checks and user journey tests
  • Build focused dashboards: Create views that highlight important metrics
  • Configure alerts: Set up notifications that drive action
  • Review and improve: Schedule regular checks to refine your setup

Observability evolution path

Let's explore how monitoring evolves to support your needs:

LevelMonitoring CapabilityBusiness Impact
1Basic metrics & logsFaster incident response
2Structured monitoringProactive issue prevention
3Auto-remediation99.99% uptime
4Predictive analyticsZero-impact changes


How monitoring flows

 

Optimize your monitoring pipeline

Reliable monitoring systems need a few key elements:

  • Use smart sampling to keep data clean without losing important signals
  • Balance traffic loads to capture all critical metrics
  • Index data for quick issue detection
  • Logically structure data to simplify troubleshooting

Security

What you should think about for monitoring data:

  • Filter sensitive data at collection
  • Implement retention policies that match compliance
  • Monitor access patterns throughout the system

Key metrics

Here's what to track to make your systems scale well:

  • Mean Time to Recovery (MTTR): How quickly you bounce back from incidents
  • Mean Time Between Failures (MTBF): Time between system issues - longer is better
  • Service Level Objectives (SLO): Your reliability targets and commitments
  • Error Budget & Burn Rate: Acceptable failure threshold and consumption rate

Every small improvement takes you closer to a monitoring system that prevents issues rather than just reacting to them. Build it step by step and watch those late-night alerts become a thing of the past.

Your greatest work
is just on the horizon

Free trial
© 2025 Platform.sh. All rights reserved.