Mid-level Site Reliability Engineer specializing in cloud infrastructure and Kubernetes
Austin, TXSite Reliability Engineer4 years experienceMid-LevelCloud ComputingTechnologyDevOps
ScreenedIdentity Verified
Connect with Manish
Manish already has a relationship with Reval, so a warm intro from us gets a much better response than cold outreach.
Recommended
Already have an account?
About
Backend/infra-focused engineer who owned production systems for distributed ML experimentation (hyperparameter tuning across a cluster with GPU scaling, custom scheduling, and checkpoint-based fault tolerance). Also built and operated a low-latency log validation service using queued async workflows with idempotency, retries/backoff, and strong observability, plus experience building resilient Selenium-based browser automations for complex multi-step web flows.
Experience
Site Reliability EngineerEthosTech
Cloud Infrastructure Engineer (Research)University Of Georgia
Software EngineerAccolite
Education
University Of Georgiamaster, Computer Science (2024)
Key Strengths
Built and owned an end-to-end distributed hyperparameter tuning platform with orchestration, scaling, and checkpointing
Designed custom cluster scheduling to reduce resource imbalance and improve throughput
Implemented fault-tolerant checkpoint/resume for long-running distributed training jobs
Built low-latency backend processing using queues/workers to absorb traffic spikes
Production reliability practices: retries with exponential backoff, idempotency/deduplication, stateless services, and job reprocessing on worker failure