Developer-turned-SRE with 5+ years of experience building and operating production-grade distributed systems across AWS, Docker, Kubernetes, and Linux infrastructure. Currently ensuring platform reliability at Solidigm (SK Hynix), combining strong backend engineering skills with deep operational expertise in observability, CI/CD automation, and incident management.
I started my career as a software engineer at a high-growth startup, building backend services and RESTful APIs with Node.js and PostgreSQL. That hands-on experience shipping features under tight deadlines shaped my ownership mentality and natural gravitation toward reliability engineering.
Today, I combine strong backend engineering skills in Node.js, Java, and Python with deep operational expertise to ensure systems stay up, perform well, and scale gracefully. I specialize in CI/CD automation, incident management, capacity planning, and production readiness for event-driven microservices.
I hold a Master of Science (Summa Cum Laude) from the Virginia Institute of Science and Technology and a B.Tech in Computer Science from JNTUH, Hyderabad.
Infrastructure at Scale
Managing multi-tenant enterprise platforms on AWS with Docker, Kubernetes, and Terraform across production environments.
Full-Stack Observability
Building monitoring strategies with Splunk, Dynatrace, Prometheus, and Grafana covering 50+ API endpoints with SLI/SLO tracking.
AWS Certified
AWS Certified DevOps Engineer — Professional with deep expertise across EC2, ECS, Lambda, RDS, S3, SQS, SNS, and CloudWatch.
02.Where I've Worked
Site Reliability Engineer @ Solidigm
June 2024 - Present · San Jose, California
Own end-to-end production reliability for a multi-tenant enterprise platform, managing infrastructure provisioning, CI/CD pipelines, observability, and incident response.
Architected a comprehensive monitoring strategy using Splunk, Dynatrace, CloudWatch, Prometheus, and Grafana with 20+ dashboards covering API latency (p50/p95/p99) and error budgets.
Designed zero-downtime deployment pipelines using GitHub Actions, reducing failed deployments by 70% and cutting deployment time from 25 to 8 minutes.
Automated operational toil using AWS Lambda, EventBridge, and Python/Bash scripts, eliminating 15+ hours/week of manual effort.
Led incident management for production outages, reducing recurring incidents by 55% through preventive automation and blameless post-mortems.
Architected a comprehensive monitoring strategy with 20+ dashboards covering infrastructure health, API latency (p50/p95/p99), error budgets, and resource utilization. Integrated Splunk, Dynatrace, CloudWatch, Prometheus, and Grafana for full-stack visibility across a multi-tenant platform.
SplunkDynatracePrometheusGrafanaCloudWatchPython
Zero-Downtime CI/CD Pipeline
Solidigm
Designed deployment pipelines using GitHub Actions with parallel jobs, health checks, atomic release switching, and automated rollback. Reduced failed deployments by 70% and cut deployment time from 25 minutes to 8 minutes with blue-green and canary strategies.
GitHub ActionsDockerAWS ECSTerraformNginx
Event-Driven Order Pipeline
ValueLabs
Engineered a Kafka-based order processing pipeline handling 50K+ daily orders with 8 topic partitions and 3 consumer groups. Implemented dead-letter queues, exactly-once delivery semantics, and a Redis caching layer that cut p99 latency from 450ms to 85ms.
Apache KafkaRedisNode.jsPostgreSQLAWS SQS
Serverless Notification Service
ValueLabs
Built a high-throughput notification microservice processing 100K+ daily events via AWS SQS FIFO queues with exponential backoff retry logic. Implemented fan-out dispatch using SNS with topic filtering across email (SES), SMS, and push channels.
AWS LambdaSQSSNSSESEventBridgePython
Infrastructure as Code Platform
Solidigm
Engineered repeatable provisioning of EC2, RDS, S3, Lambda, EventBridge, and IAM resources using Terraform and AWS CDK. Maintained environment parity across dev/staging/prod with least-privilege IAM access controls and automated compliance checks.
TerraformAWS CDKCloudFormationIAMPythonBash
Production Toil Automation
Solidigm
Automated 15+ hours/week of operational toil using AWS Lambda and EventBridge for nightly maintenance: usage quota recalculation, stale session cleanup, metrics aggregation, and certificate rotation across production environments.
AWS LambdaEventBridgePythonBashCloudWatch
04.Skills & Technologies
Languages
Python
Node.js
JavaScript
Java
Go
Bash
Backend & APIs
Express.js
Spring Boot
RESTful APIs
GraphQL (Apollo Server)
Prisma ORM
Sequelize
Cloud & IaC
AWS
GCP
Linux (Ubuntu, RHEL)
Terraform
AWS CDK
CloudFormation
Ansible
Containers
Docker
Kubernetes (EKS)
Helm
Nginx
ALB
Auto-Scaling
Observability
Splunk
Dynatrace
Prometheus
Grafana
Datadog
CloudWatch
ELK Stack
OpenTelemetry
CI/CD & Release
GitHub Actions
Jenkins
GitOps
CodeQL
SonarQube
Blue-Green
Canary
Rollback Strategies
Data & Messaging
PostgreSQL
MySQL
MongoDB
Redis
Apache Kafka
SQS/SNS
EventBridge
DLQs
SRE Practices
SLI/SLO/SLA
Incident Mgmt
RCA
Runbooks
On-Call
Capacity Planning
Post-Mortems
Error Budgets
PagerDuty
ServiceNow
JIRA
05. What's Next?
Get In Touch
I'm always open to discussing new opportunities, interesting projects, or just connecting with fellow engineers. Whether you have a question or just want to say hi, feel free to reach out.