Key Responsibilities
Architecture and Design
Design and implement scalable, resilient, and secure platform solutions
Develop and maintain infrastructure-as-code using tools like Terraform, Cloud-Formation and Ansible
Create and optimize CI/CD pipelines for efficient software delivery
Architect cloud-native solutions leveraging containerization and microservices
Implement disaster recovery and business continuity strategies
Infrastructure Management
Manage and optimize our Public cloud infrastructure (AWS, Azure, or GCP)
Manage and optimize private cloud infrastructure in partner premises.
Implement best practices for cloud security, compliance, and cost optimization
Design and implement multi-region and multi-cloud strategies
Design and maintain containerized application environments using Docker
Architect, deploy, and manage Kubernetes clusters for container orchestration
Automation and DevOps
Develop automation scripts and tools to streamline operations and reduce manual tasks
Integrate monitoring, alerting, and logging systems
Ensure Standardized QA and Production environments through implementation of proper branching strategies
Configure and manage load balancers (e.g., NGINX, HAProxy, cloud-native solutions)
Implement and manage service mesh technologies (e.g., Istio, Linkerd) for microservices architectures
Performance Optimization
Analyse and optimize system performance, identifying and resolving bottlenecks
Conduct capacity planning and implement auto-scaling solutions
Optimize container resource allocation and performance
Team Leadership and Collaboration
Mentor junior engineers and provide technical guidance to the team
Collaborate with cross-functional teams to align platform capabilities with business needs
Contribute to technical decision-making and architectural reviews
Documentation and Knowledge Sharing
Maintain comprehensive technical documentation for platform components and processes
Contribute to internal knowledge bases and conduct knowledge-sharing sessions
L2 Support and Escalation Management
Provide expert-level troubleshooting and resolution for critical platform and infrastructure problems
Analyze recurring issues and implement long-term solutions to prevent future occurrences
Collaborate with the operations team to improve support processes and knowledge transfer
Conduct post-incident reviews and implement lessons learned to enhance system reliability
Required Qualifications
Bachelor's degree in Computer Science, Engineering, or a related field
5+ years of experience in platform engineering, DevOps, or similar roles
Strong proficiency in at least one cloud platform (AWS, Azure, or GCP)
Expert-level knowledge of containerization technologies (Docker, Kubernetes)
Extensive experience with infrastructure-as-code tools (e.g., Terraform, CloudFormation, pulumi)
Proficiency in scripting languages (e.g.Bash, )
Strong understanding of networking concepts, load balancing, and CDNs
Experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack)
Excellent problem-solving skills and ability to troubleshoot complex systems