Site Reliability Engineer - SRE

Open Practice Solutions, Ltd.

Hudson, United States of America

6 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Compensation

$ 85K

Job location

Hudson, United States of America

Tech stack

Java

Amazon Web Services (AWS)

Software as a Service

Databases

Continuous Integration

Data Stores

Data Warehousing

Linux

DevOps

Fuzz Testing

Memcached

MySQL

Nagios

Performance Tuning

Redis

Reliability Engineering

Web Applications

Data Logging

Java Application Server

Amazon Web Services (AWS)

Grafana

Gitlab

Graphite

Cloudwatch

Terraform

Redshift

Job description

We're looking for a Site Reliability Engineer to help operate and scale a multi-tenant, web-based application running on AWS. This is a hands-on role for someone who's comfortable jumping into an already-established architecture, making incremental improvements, and solving real production problems.

You'll work closely with engineering and product teams to keep our platform reliable, performant, and scalable as customer usage grows. This is not a "greenfield rewrite" role - we need someone scrappy, practical, and effective inside real-world constraints.

What You'll Do

Ensure the reliability, availability, and performance of a multi-tenant production system
Scale and operate AWS-based infrastructure supporting a Java web application
Monitor and troubleshoot issues across application, database, cache, and data warehouse layers
Improve observability through metrics, logging, and alerting
Participate in on-call rotations and lead incident response and root cause analysis
Identify performance bottlenecks and scaling limits in a shared-tenant environment
Automate operational tasks and reduce toil where it matters most
Work within existing frameworks and tooling to make systems safer and more scalable
Partner with developers to improve deployments, capacity planning, and failure handling
Implement automated load and fuzz testing
Define key service level objectives (SLO)

Technologies You'll Work With

AWS (EC2, ECS, RDS, ElastiCache, Redshift, and related services)
Java-based web applications
MySQL (performance tuning, scaling, reliability)
Amazon ElastiCache (Redis/Memcached)
Amazon Redshift
Monitoring and alerting tools (Graphite, Grafana, Cloudwatch)

Requirements

3+ years of experience in SRE, DevOps, or production operations roles
Strong understanding of AWS infrastructure and cloud-native scaling patterns
Experience supporting Java applications in production
Solid knowledge of MySQL performance, replication, and scaling strategies
Experience operating cache layers and data stores at scale
Understanding of multi-tenant architectures, including isolation, noisy-neighbor issues, and capacity planning
Strong Linux fundamentals and troubleshooting skills
Ability to stay calm, think clearly, and prioritize during incidents
A "get-things-done" mindset - pragmatic, resourceful, and comfortable with imperfect systems

Nice to Have

Experience scaling multi-tenant SaaS platforms
Familiarity with Redshift performance tuning and data workflows
Infrastructure-as-code experience (Terraform)
CI/CD and GitLab pipeline experience
Prior ownership of on-call rotations and incident processes
Experience improving reliability without large architectural rewrites

What We Value

Engineers who work within reality, not just ideal architectures
Incremental improvements that reduce risk and improve uptime
Clear communication during incidents
Ownership, accountability, and practical problem-solving

Benefits & conditions

Pay: From $85,000.00 per year

Application Question(s):

Describe a production outage you personally worked on, including how it was detected, how you mitigated it, and what permanent changes you implemented to prevent recurrence.
Explain how you would design a highly available architecture in Amazon Web Services for a Java-based, multi-tenant SaaS application using MySQL and Redis.
How would you prevent a single tenant from exhausting shared database or cache resources in a multi-tenant SaaS platform?
Tell us about a time you inherited a fragile production system and what you prioritized first to improve reliability.
Your API latency increases from 80ms to 900ms across all tenants while CPU utilization remains normal. What are the first steps you take to investigate?
We are only considering candidates currently based in Northeast Ohio for this role (relocation is not being considered at this time). Are you currently located in Northeast Ohio?

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

Apply for this position

Good distractions

Moments

Videos View all