Site Reliability Engineer

SRE fits right at the crossroads of IT operations, support and software engineering. SRE serves as the perfect blend of skills to tighten the relationship between IT and developers – leading to shorter feedback loops, better collaboration and more reliable software.

Department: Operations
Project Location(s): San Jose, Costa Rica

Objectives of this role

  • Run the production environment by monitoring availability and taking a holistic view of system health
  • Build software and systems to manage platform infrastructure and applications
  • Improve reliability, quality, and time-to-market of our suite of software solutions
  • Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement
  • Provide primary operational support and engineering for multiple large-scale distributed software applications

Required skills and qualifications

  • Bachelor’s degree (or equivalent) in computer science or related discipline
  • Ability to program (structured and OOP) using one or more high-level languages, such as Python, Java, C/C++, Ruby, and JavaScript
  • Experience with distributed storage technologies such as NFS, HDFS, Ceph, and Amazon S3, as well as dynamic resource management frameworks (Apache Mesos, Kubernetes, Yarn)
  • Proactive approach to identifying problems, performance bottlenecks, and areas for improvement

Responsibilities

  • Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding
  • Partner with development teams to improve services through rigorous testing and release procedures
  • Participate in system design consulting, platform management, and capacity planning
  • Create sustainable systems and services through automation and uplifts
  • Balance feature development speed and reliability with well-defined service-level objectives