About the Role
We are seeking a Site Reliability Engineer (SRE) to improve system reliability, scalability, and operational efficiency for enterprise and cloud-based applications.
Job Description
- Maintain highly available production systems
- Monitor infrastructure and application health
- Automate operational and support activities
- Handle incident response and root-cause analysis
- Improve observability and system performance
- Work with cloud-native technologies
Responsibilities
- Develop monitoring and alerting solutions
- Optimize infrastructure performance and uptime
- Support Kubernetes and cloud platforms
- Manage log aggregation and observability tools
- Automate infrastructure provisioning
Required Skills
- Experience with Kubernetes and cloud platforms
- Knowledge of Prometheus, Grafana, ELK, Splunk
- Familiarity with Terraform/Ansible
- Strong Linux administration skills
- Experience in incident management and troubleshooting
Experience
- 4 to 10 Years