AI Infrastructure:Build and lead the cross-functional team that designs, scales, and operates Rogues production ML - Ops platformcovering data pipelines, model versioning, automated deployments, and real-time monitoring across on-prem and cloud GPU clusters. Own reliability, performance, and cost management for all AI compute and storagecapacity planning, incident response, and continuous optimization to meet SLA/ SLO targets. Site Reliability:Direct the SRE organization that safeguards roguefitness.com and all internal appsdefining SL - Is/ SL - Os, automating CI/ CD pipelines, and ensuring release velocity without sacrificing stability. Drive proactive reliability engineering: establish unified observability, conduct capacity and chaos testing, and lead rapid incident response to keep MTTR low and uptime above targets. Own continuous improvement of performance, scalability, and cost efficiencypartnering with product and infrastructure teams to embed reliability best practices fr...Director, Infrastructure, Reliability Engineer, Manufacturing, Technology, Cloud