Vetted Observability Professionals

Pre-screened and vetted.

Sourabh Jain - Director of Software Engineering specializing in enterprise Data, ML & AI platforms in Bay Area, CA

Sourabh Jain

Screened

Director of Software Engineering specializing in enterprise Data, ML & AI platforms

Bay Area, CA23y exp
RSA SecurityShri G. S. Institute of Technology and Science

Former Walmart Director of Software Engineering who left in March 2025 to build products for clients. Recently delivered an LLM/RAG-based UNSPSC classification solution for an MRO client using a multi-stage retrieval + web search + prompt-engineering workflow, and has led large-scale retail forecasting initiatives and high-severity cloud-migration incidents end-to-end.

View profile
JK

Mid-level Software Engineer specializing in backend, cloud, and AI systems

Seattle, WA4y exp
AmazonSaint Louis University

Engineer with hands-on experience across backend, full-stack, cloud, and AI/ML systems, with particular depth in Python, FastAPI, AWS Bedrock, SageMaker, and RAG-based architectures. Stands out for treating AI and agents as accelerators within disciplined production engineering, emphasizing guardrails, observability, latency/cost monitoring, and scalable system design.

View profile
SB

Suraj Botcha

Screened

Intern AI/ML Engineer specializing in LLM systems and industrial AI

Remote1y exp
ControlRooms.AICarnegie Mellon University

Full-stack AI engineer who has built both document-intelligence products and agentic investigation systems end to end. At ControlRooms.AI, they helped ship a production-facing root cause investigation workflow for industrial operations using Neo4j, FastMCP, RAG, OCR/VLM inputs, and multiple LLMs, contributing to roughly a 10x reduction in manual investigation time. They stand out for designing explainable, traceable AI systems that surface evidence, uncertainty, and missing context rather than forcing overconfident answers.

View profile
JY

Jay Yepuri

Screened

Mid-level Software Engineer specializing in AWS backend and cloud infrastructure

Seattle, WA4y exp
AmazonIndiana University Bloomington

Full-stack engineer with AWS Skill Builder experience building internal content-management and search workflows in TypeScript across React and Node.js. They drove a shift from keyword to semantic search using OpenSearch and Bedrock Titan Embeddings, delivering a 5x reduction in discovery time while also improving production reliability and observability for large-scale content workflows.

View profile
SF

Sara Fang

Screened

Mid-level Software Engineer specializing in cloud data platforms and distributed systems

Remote6y exp
Terra Byte XUniversity of Delaware

Backend/data engineer with production experience building FastAPI services with strong reliability patterns (circuit breaker, rate limiting, caching, graceful degradation) and JWT/OAuth2 auth. Has delivered AWS EKS deployments via Terraform with Secrets Manager/IRSA and HPA autoscaling, and built Glue/Spark ETL pipelines on S3 Parquet with schema-evolution and idempotent reruns; also demonstrated measurable SQL tuning impact (20–30s to <10s).

View profile
BS

Mid-level Full-Stack Developer specializing in cloud-native backend services and real-time data platforms

Remote, USA4y exp
NetflixUniversity of Dayton

Backend/data engineering candidate with Netflix experience designing and migrating analytics platforms from batch to real-time streaming (Kafka/Flink) across AWS and GCP. Delivered measurable improvements (40% lower data delay, 99.9% accuracy) using phased rollouts, automated data validation (Great Expectations), and strong observability (Prometheus/Grafana), and proactively hardened pipelines with idempotency to prevent duplicate Kafka processing.

View profile
BK

Mid-level Full-Stack Software Engineer specializing in cloud microservices and AI integration

Jersey City, NJ3y exp
UberPace University

Backend/distributed-systems engineer with Uber experience building real-time telemetry and safety signal pipelines. Strong in Kafka-based event-driven architectures, low-latency processing under peak load, and production reliability via monitoring, retries, and fallback logic; has Docker/Kubernetes and CI/CD deployment experience.

View profile
MC

Intern Firmware Validation & Systems Test Engineer specializing in embedded and full-stack tooling

Palo Alto, CA1y exp
TeslaOregon State University

Safety-critical firmware validation engineer with Tesla autonomous vehicle experience who built Python-based HIL/SIL automation and dashboards, cutting regression time by 30% while maintaining an auditable risk-tradeoff process with safety and engineering teams. Also deployed an inventory management system across 8+ R&D teams in 3 countries at FUJIFILM, troubleshooting a major cross-site sync issue to a timezone root cause with strong documentation and interim mitigations.

View profile
TW

Tianyi Wang

Screened

Entry-Level Backend/Cloud Engineer specializing in distributed systems and AI platforms

Seattle, WA1y exp
AmazonUniversity of Michigan

Full-stack engineer with deep serverless AWS experience who built VidToNote, an AI video analysis platform, end-to-end using Next.js App Router/TypeScript and an event-driven pipeline (API Gateway, Lambda, DynamoDB, S3, Step Functions, SQS). Strong on production reliability and observability (CloudWatch, X-Ray, structured logging), plus data/analytics work in Postgres with measurable query optimizations and durable LLM evaluation workflows. Amazon background; integrated 22 AWS services and completed AWS Solutions Architect Professional certification within a month.

View profile
TH

Mid-level Software Engineer specializing in backend systems, IoT, and AI security

Pittsburgh, PA3y exp
NapticCarnegie Mellon University

Full-stack engineer in the investment tracking/financial reporting space who built an automated reporting dashboard and compliance/reporting pipeline end-to-end using Next.js (App Router, server/client components), REST, and Postgres. Demonstrated measurable performance wins (~30% faster loads) through caching and query optimization, and built durable orchestrated workflows in n8n with retries, idempotency, and reconciliation checks.

View profile
PY

Mid-level Software Development Engineer specializing in AWS telemetry and DDoS mitigation

Seattle, WA3y exp
Amazon Web ServicesTexas A&M University-Commerce

Amazon engineer who built an Amazon Bedrock-powered summarization layer over large-scale network/service telemetry (“top talker” insights) to help security engineers triage anomalies faster. Emphasizes production-grade design patterns for LLM features—non-blocking enrichment, deterministic fallbacks, strict structured outputs, and monitoring to preserve trust in source-of-truth telemetry.

View profile
Andrew Liang - Intern Software Engineer specializing in full-stack and AI/ML systems

Andrew Liang

Screened

Intern Software Engineer specializing in full-stack and AI/ML systems

2y exp
AmazonUCLA

Software engineer with experience at Amazon and Agora building end-to-end systems: a knowledge-base AI chatbot (React/TypeScript UI + retrieval/response backend + Docker deployment) and an internal approval governance platform using AWS Step Functions and DynamoDB. Emphasizes fast iteration without sacrificing trust via feature-flag rollouts, citation-required answers, abstention on low-confidence retrieval, regression query sets, and strong observability (request IDs, structured logs, latency/error monitoring).

View profile
EX

Elizabeth Xu

Screened

Entry-Level Software Engineer specializing in ML/NLP and security

Evanston, IL1y exp
RakutenNorthwestern University

Early-career engineer (internship background) who built a production-style notes product using Next.js App Router with Server Components/Server Actions and a Postgres-backed analytics model. Demonstrates strong performance and reliability instincts—measured DB latency improvements via indexing and cursor pagination, plus durable orchestration with Temporal using idempotency and deterministic workflows.

View profile
Jacqueline Zhang - Mid-level Machine Learning Engineer specializing in LLMs, fairness, and healthcare ML in Illinois, USA

Mid-level Machine Learning Engineer specializing in LLMs, fairness, and healthcare ML

Illinois, USA4y exp
iSchool Statistical ML & AI LabUniversity of Illinois Urbana-Champaign

ML/NLP practitioner with a master’s thesis focused on domain-adaptive knowledge distillation for LLMs (LLaMA2/sheared LLaMA), showing improved perplexity and ROUGE-L on biomedical data. Also built real-world data linking and search systems: integrated ClinicalTrials.gov with FAERS using fuzzy matching + embeddings, and delivered an LLM-powered FAQ recommender at Hyperledger using sentence-transformers, FAISS, and fine-tuning to mitigate embedding drift.

View profile
Yeshwanth Sai Pala - Mid-level Full-Stack Developer specializing in cloud microservices and AI-driven FinTech in Remote, USA

Mid-level Full-Stack Developer specializing in cloud microservices and AI-driven FinTech

Remote, USA4y exp
StripeSouthern Arkansas University

Stripe engineer who shipped an end-to-end merchant fraud insights dashboard, spanning Spring Boot/Kafka risk-scoring services and a React+TypeScript UI. Focused on low-latency, high-volume transaction processing and production operations on AWS (EKS/CloudWatch), including handling a real traffic-spike latency incident via query optimization, indexing, and rate limiting.

View profile
Likhitha Bethi - Mid-level Software Engineer specializing in backend systems, distributed systems, and applied AI in Stony Brook, NY

Mid-level Software Engineer specializing in backend systems, distributed systems, and applied AI

Stony Brook, NY4y exp
Stony Brook UniversityStony Brook University

Goldman Sachs engineer who owned end-to-end features for an internal onboarding and case management platform, spanning React/TypeScript UI, a GraphQL gateway, and Node + Spring WebFlux microservices. Built and operated a Kafka-based ingestion and search pipeline with DLQs, retries, idempotency, and strong observability, and improved developer experience via backward-compatible GraphQL API design and schema-driven documentation.

View profile
XL

Xicheng Liang

Screened

Intern AI/Full-Stack Engineer specializing in backend systems and applied machine learning

Chicago, IL1y exp
Becker’s HealthcareUniversity of Pennsylvania

Built and shipped a production agentic RAG system for healthcare analysts that automated compliance/operations knowledge retrieval across PDFs, reports, and databases. Emphasizes production reliability (monitoring, retries, fallbacks, async queues), strong evaluation/iteration loops, and measurable impact (3–10s responses and ~98% top-k retrieval accuracy).

View profile
SS

Steven Schoen

Screened

Staff Android Engineer specializing in mobile platform and design systems

Berkeley, CA12y exp
RedditUniversity of Central Florida

Built and shipped a production internal framework-adoption agent for design system leadership, using Temporal, Google ADK, and a Slack bot interface. They appear to be an early internal builder of agentic systems at their company, with practical experience in prompt/process design, lightweight orchestration, and reliability tradeoffs for real-world LLM workflows.

View profile
SK

Intern Software Engineer specializing in developer productivity and data/AI systems

Los Angeles, California1y exp
IntuitUC Berkeley

Internship experience at Intuit building an LLM-grounded QA system for internal microservice data across 100+ microservices, using a graph database approach (evaluated Neo4j and selected AWS Neptune for production alignment). Also has UC Berkeley research experience (including work with Prof. Dawn Song / Berkeley Eye Research Lab) and cross-functional collaboration with bioinformatics/biology teams to deploy software systems on research servers.

View profile
YY

Yue Yang

Screened

Intern Data Scientist specializing in GenAI (LLMs, RAG) and ML model optimization

Sunnyvale, CA1y exp
SynopsysColumbia University

Built and deployed a production LLM-powered risk assistant for KPMG and Freddie Mac that lets analysts query a confidential Neo4j risk graph in natural language (no Cypher), turning multi-day analysis into minutes with traceable, cited answers. Implemented rigorous guardrails, deterministic verification, RBAC/security controls, and a full eval/observability stack, cutting query error rate by ~50% and iterating through weekly UAT with non-technical risk analysts.

View profile
JR

Senior Software Engineer specializing in distributed systems and AI workflow orchestration

Austin, TX5y exp
AppleUniversity of Central Missouri

Backend owner at Apple for an AI workflow orchestration service, with hands-on experience stabilizing peak-traffic production systems using OpenTelemetry-style tracing, bounded async concurrency, and database performance tuning. Built and shipped a Python LLM-agent orchestration layer to automate multi-step operational workflows, emphasizing guardrails, auditability, and deterministic fallbacks to keep non-deterministic AI behavior production-safe.

View profile
SB

Mid-level Backend & Reliability Engineer specializing in AWS, Kubernetes, and automation

New Mexico, US5y exp
MetaUniversity of North Carolina at Charlotte

Meta engineer focused on reliability/operations tooling who built a unified real-time health dashboard and scalable telemetry pipelines (AWS + Datadog) for thousands of devices. Also shipped an internal LLM-powered knowledge assistant using RAG over wikis/runbooks/logs with strong guardrails and a rigorous eval loop that drove measurable accuracy improvements via automated doc ingestion and embedding updates.

View profile
PG

Pankaj Gautam

Screened

Senior Cloud Infrastructure & TechOps Leader specializing in AWS, Kubernetes, and SRE

San Francisco, CA27y exp
AmazonCal State East Bay

Infrastructure/platform engineer with hands-on experience running production and non-production Amazon EKS clusters, including upgrade processes and reliability monitoring via Prometheus/Grafana. Also administered on-prem VMware vSphere/vCloud Director and handled a significant vSwitch/VLAN outage, and uses Terraform + Terragrunt with S3 remote state and release-based drift detection across dev/stage/prod.

View profile

Need someone specific?

AI Search