Senior Site Reliability Engineer
Bidgely
Who We Are
Bidgely (which means "electricity" in Hindi) is an AI-powered SaaS Company accelerating a clean energy future by enabling energy companies and consumers to make data-driven energy-related decisions.
Ranked #7 in Applied AI on Fast Company’s list of Most Innovative Companies in the World, Bidgely is putting customers at the center of the clean energy future.
What We Do
Powered by our unique patented technology, Bidgely's UtilityAI™ Platform transforms multiple dimensions of customer data - such as energy consumption, demographics, and interactions into deeply accurate and actionable consumer energy insights. We leverage these insights to empower each customer with personalized recommendations tailored to their individual personality and lifestyle, usage attributes, behavioral patterns, purchase propensity and beyond.
How We Do It
From a distributed energy resources (DER) and grid edge perspective, Bidgely is advancing smart meter innovation with data-driven solutions for solar PVs, electric vehicle (EV) detection, EV behavioral load shifting and managed charging, energy theft, short-term load forecasting, grid analytics and time of use (TOU) rate designs. Bidgely’s UtilityAI™ energy analytics provides deep visibility into generation and consumption for better peak load shaping and grid planning and delivers targeted recommendations for new value-added products and services. For more information, please visit www.bidgely.com or the Bidgely blog at bidgely.com/blog.
How You Fit In
Job Description
As a Site Reliability Engineer (SRE), you will play a pivotal role in ensuring the reliability and availability of Bidgely's data processing infrastructure, API services, and customer-facing applications. You will work closely with our DevOps, Product Support, and Platform & Infra teams to develop and implement solutions that proactively detect, prevent, and resolve operational issues. Your efforts will directly enhance our customers' experience by ensuring that Bidgely’s services are fast, reliable, and scalable.
Key Responsibilities:
● System Reliability and Uptime:
Ensure high availability and performance of critical systems, APIs, and infrastructure components.
Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to maintain system reliability standards.
Develop and maintain error budgets to balance new feature development with reliability.
● Real-Time Monitoring and Alerting:
Implement and maintain comprehensive monitoring and alerting solutions across critical services, such as APIs, data processing pipelines, databases (e.g., Cassandra, Redshift), and cloud infrastructure.
Set up proactive monitoring for API latency, system load, throughput, and error rates to identify issues before they impact customers.
Collaborate with DevOps and Platform & Infra teams to create end-to-end observability for the entire data processing ecosystem.
● Incident Management and Root Cause Analysis:
Act as the first responder to high-severity incidents, taking ownership of incident management and response.
Conduct thorough root cause analysis post-incident, working closely with cross-functional teams to implement long-term resolutions.
Develop incident runbooks and playbooks to streamline incident response and reduce Mean Time to Recovery (MTTR).
● Automation of Toil and Operational Efficiency:
Identify and automate repetitive, manual tasks to minimize operational overhead, particularly within data ingestion, disaggregation, and notification workflows.
Implement self-healing solutions for commonly recurring issues, reducing the need for manual intervention.
Enhance operational efficiency by optimizing resource utilization across infrastructure components like EMR clusters, Redis instances, and SQS queues.
● Capacity Planning and Scalability:
Perform regular capacity planning to ensure our systems can handle future growth and data processing needs, especially during peak usage periods.
Collaborate with the Platform & Infra and DevOps teams to scale infrastructure effectively, ensuring we meet SLAs for data processing and customer response times.
Monitor and optimize infrastructure costs by ensuring efficient resource allocation and cloud utilization.
● Performance Optimization:
Continuously monitor system performance and optimize APIs, databases, and backend services to reduce latency and improve response times.
Address performance bottlenecks in the data processing pipeline to ensure timely aggregation, disaggregation, and notification generation.
Develop strategies to improve the accuracy and quality of data insights provided to customers.
● Documentation and Cross-Team Collaboration:
Document all reliability processes, runbooks, and incident resolution steps to maintain clear, actionable resources for the team.
Work closely with Product Support to ensure that customer-impacting issues are resolved quickly, and with DevOps to streamline the deployment and release processes.
Collaborate on building a culture of reliability and efficiency across the organization.
Key Performance Indicators (KPIs):
● Service Uptime and Availability: Percentage uptime for critical services and systems.
● Mean Time to Recovery (MTTR): Average time to resolve incidents and restore services.
● Incident Frequency: Number of incidents and issues per period, aiming for continuous reduction.
● Error Budget Compliance: Adherence to error budgets without breaching SLOs.
● Automation Coverage: Percentage of manual tasks that have been automated to reduce operational workload.
● Latency and Performance Metrics: API latency (P50, P95, P99) and system throughput for key workflows.
Qualifications:
● B.Tech/Bachelor in Computer Science or a related field (math, physics, engineering)
● 3-7 years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure Engineering.
● Strong experience with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, ELK stack).
● Proficiency in automation and scripting (e.g., Python, Bash, Terraform) to manage infrastructure as code and automate repetitive tasks.
● Hands-on experience with cloud platforms (AWS preferred) and knowledge of services like EC2, SQS, EKS, S3, RDS, Redshift, EMR.
● Experience with incident management and root cause analysis methodologies.
● Familiarity with database systems (e.g., Cassandra, Redis, MySQL) and large-scale data processing pipelines.
● Excellent problem-solving skills, with a proactive approach to identifying and resolving issues.
● Proficient in SQL (Basic and Advanced) to be able to analyze error and log data and identify pattern to reduce # of recurring issues or identify top opportunity areas to reduce ticket volume.
● Bonus point if he is aware of any reporting tool like Tableau/Power BI /Looker etc
What You Get
Perks
- Growth Potential with a Startup: Seize the opportunity to grow with an innovative startup
- Collaborative Environment: Work with a passionate team united by the goal of a clean energy future
- Unique Tools: We provide all the tools you need to excel in your role
- Group Health Insurance, Internet/Telephone Reimbursement, Professional Development Allowance, Gratuity etc
- Mentorship Programs: Opportunity to learn and receive mentorship from industry experts
- Flexible Work Arrangements: Benefit from flexible working arrangements, Work anywhere from India
Diversity, Equity, Inclusion and Equal Opportunity
Bidgely is an equal-opportunity employer. We are serious about and embrace diversity and equal opportunity. We are committed to building a team that represents a variety of backgrounds, perspectives, and skills to build a better future and a better workforce. Our hiring decisions are based on your skills, talent, and passion – not on your background, gender, race, age, or the quirky way you dance at office parties.
Bidgely is an EVerify employer.