Site Reliability Cloud Engineer
Company
IBM
Location
US Austin
Type
Full Time
Job Description
Working in IBM Cloud gives you the platform to learn develop and utilize your skills everyday by working on the latest cloud related technology products and services. You’ll be working in an environment where we understand how we can thrive best when we play to our strengths. That’s why developing our people is key to our success the door is always open for those ready to advance their career.
Curiosity and courageous thinking are both vital when working in IBM Cloud as we continue our dedication in guaranteeing that we are at the forefront of cloud technology. Our renowned legacy means we are leading the way in everything from analytics and security through to unmatched hardware & software designs. We provide our clients with the full end-to-end transformation as we build IBM’s next generation cloud platform which is focused around delivering performance and predictability at a global scale.
IBM’s product and technology landscape includes Research Software and Infrastructure. Entering this domain positions you at the heart of IBM where growth and innovation thrive.
Your Role and Responsibilities
We are looking for a dynamic Site Reliability Engineer to join our Cloud IaaS Operations Team in Austin TX who is responsive to market needs to deliver value to our clients in a fast-changing cloud landscape. An SRE individual spends 50% time on toil and 50% on engineering projects. It requires full-stack systems thinking and coding skills with app/service availability focus that is data-driven and AI including machine learning. The SRE team dedicated to ensuring that the IBM Cloud is at the forefront of cloud technology from data center design Storage & Network architecture and compute clusters to flexible infrastructure services. We are operating IBM’s cloud platform building IBM’s next generation cloud platform and VMware solutions to deliver performance and predictability for our customers’ most demanding workloads at global scale and with leadership efficiency resiliency and security. It is an exciting time and as a team we are driven by this incredible opportunity to thrill our clients.
Primary Roles & Responsibilities:
In this Site Reliability Engineer role you will work closely with several Data Centers the entire Cloud organization and IBM vendors to support maintain and operationally improve the IBM cloud infrastructure. You will focus on the following key responsibilities:
- Monitor the health of production and test systems
- Ability to respond promptly to production issues and alerts
- Execute changes in the production environment through automation and AI
- Partner with other SRE teams and program managers to deliver mission-critical services to the market
- Support development of new and existing capabilities for our compute storage and network infrastructure services
- Implement and automate infrastructure solutions that support IBM Cloud products and infrastructure
- Support the compliance and security integrity of the environment
- Automate health monitoring of the production and test systems
- Automate return to service procedures for Cloud Service delivery
- Support the compliance and security integrity of the environment through your work
- Partner with other teams functional managers and program managers to deliver mission-critical services to the market
- Creating power BI dashboards on historic and prediction data for client use case -should be involved in designing the process and implementation of key entities extraction from millions of unstructured files using python NLP techniques and Apache spark.
- Expertise in Data Interpretation and Visualization skills
- Define problems and opportunities in a complex business area
- Develop advanced analytics products
- Create and develop end-to-end data driven solutions to support and monitor the health of production and test systems
- Extract data from multiple varied sources and integrate it for analytics and application development
- Partner with other SRE teams and program managers to deliver mission-critical services to the market
- Experience with machine learning engineering to develop self-running AI software to automate predictive models
- Experience with designing machine learning systems and algorithms to generate accurate predictions.
- Working knowledge with ServiceNow JIRA Confluence and GitHub
- Working knowledge with Container technologies: Kubernetes (preferred) Docker etc.
- Hands on knowledge of log aggregate software such as Splunk or Elk
- Must have the ability to perform debugging and problem analysis by examining logs and running Unix commands
Work with Engineering to:
- Provide initial assessment and possible workaround of production issue
- Troubleshoot and resolve production issues
Work with Support and Development teams to:
- Identify and resolve issues
- Discuss and plan integration tasks
- Provide technical escalation support for other Infrastructure Operations teams
Required Technical and Professional Expertise
- Overall 8+ years of Industry experience with minimum 6+ years of experience in Machine learning
- Technology expertise of solutioning in Hadoop Hive Spark / PySpark SQL Oozie along Data Modelling in Hive
- Proven ability in solutioning covering data ingestion data cleansing ETL data mart creation and exposing data for consumers
- Scope and deliver solutions with the ability to design solutions independently based on high-level architecture.
- Data expertise to manipulate and integrate big data different data types and other structured data bases. Python Skills for Data Handling
Preferred Technical and Professional Expertise
- Up-to-date technical knowledge by attending educational workshops reviewing publications
- 6+ years of experience in virtualization environments such as AWS SoftLayer Xen or VMWARE
- Working knowledge & experience with Databases/Storage/Networking in the Cloud
- Experience with VMware NSX vRealize Operations Manager vRealize Network Insight vSAN
- Experience in maintaining cloud-based solutions with VMware vCloud Director
- Experience with replication/failover using Zerto Platform VMware vCloud Availability or Veeam Cloud Connect
Date Posted
03/06/2024
Views
4
Similar Jobs
Senior Engineering Manager, Micros Foundations - Atlassian
Views in the last 30 days - 0
Atlassian is seeking a Senior Engineering Manager to lead a team of Backend Software Engineers The role involves guiding technical decisions prioritiz...
View DetailsSenior Frontend Engineer - Simply Business
Views in the last 30 days - 0
Simply Business is seeking a Senior Frontend Engineer to join their Front End Tooling team The role involves developing products using best practices ...
View DetailsE2E Solution Architect - Ahold Delhaize USA
Views in the last 30 days - 0
Ahold Delhaize USA is seeking a Solution Architect with extensive experience in IT architecture BigData Analytics and various software designs and dev...
View DetailsE2E Solution Architect - Ahold Delhaize USA
Views in the last 30 days - 0
Ahold Delhaize USA a division of a global food retailer is seeking a Solution Architect for its US operations The role involves translating business r...
View DetailsE2E Solution Architect - Ahold Delhaize USA
Views in the last 30 days - 0
Ahold Delhaize USA is seeking a Solution Architect with extensive experience in IT architecture BigData Analytics and various software designs and dev...
View DetailsPeople Operations Advisor - BlackLine
Views in the last 30 days - 0
BlackLine is a leading provider of cloud software that automates and controls the entire financial close process The company is committed to modernizi...
View Details