Hardware Systems Quality and Reliability Engineering, University Graduate, PHD, Platforms Engineering

Google

Posted 1 day ago

Full Time

Sunnyvale, California

In Person

Smart Summary

Responsibilities

The role involves identifying and resolving fleet-wide hardware technical issues to maximize system reliability within data center environments. Responsibilities include conducting root cause analyses, optimizing system health, and leading cross-functional projects to improve hardware performance.

Qualifications

You have a PhD in Electrical Engineering, Computer Engineering, Physics, or a related field, with experience in data center system hardware domains. You are proficient in data curation, mining, analysis, and visualization using tools such as SQL, JMP, Python, R, or Tableau.

Must Have Skills for ATS

Electrical Engineering

Computer Engineering

Physics

semiconductors

PCIe

power electronics

CPU

xPU architectures

networking

embedded systems

servers

SQL

JMP

Python

R

Tableau

Job Description

Minimum qualifications:

  • PhD degree in Electrical Engineering, Computer Engineering, Physics, a related field, or equivalent practical experience.
  • Experience in any one domain of hardware engineering through internships, academic research, or publications (e.g., data center system hardware domains such as semiconductors, PCIe, power electronics, CPU/xPU architectures, networking, embedded systems, and servers.).
  • Experience in data curation, mining/analysis, visualization, and scripting utilizing tools such as SQL, JMP, Python, R, Tableau, or similar.

Preferred qualifications:

  • Experience in technical project management and effective communication with executive stakeholders.
  • Proficiency in statistical methodologies, predictive modeling, and data visualization techniques.
  • Expertise in quality and reliability engineering roles.
  • Familiarity with fault isolation and other failure analysis methodologies.
  • Track record of providing technical leadership to cross-functional engineering teams through a solution-oriented and pragmatic methodology.

About the job:

The team is responsible for identifying and resolving fleet-wide technical issues, implementing strategic product and methodological enhancements to maximize hardware system reliability, and ensuring efficient deployment and maintenance within data center environments. We conduct analysis of fleet data to address systemic issues and implement preventative measures to ensure long-term stability.

The AI and Infrastructure team is redefining what’s possible. We empower Google customers with breakthrough capabilities and insights by delivering AI and Infrastructure at unparalleled scale, efficiency, reliability and velocity. Our customers include Googlers, Google Cloud customers, and billions of Google users worldwide.

We're the driving force behind Google's groundbreaking innovations, empowering the development of our cutting-edge AI models, delivering unparalleled computing power to global services, and providing the essential platforms that enable developers to build the future. From software to hardware our teams are shaping the future of world-leading hyperscale computing, with key teams working on the development of our TPUs, Vertex AI for Google Cloud, Google Global Networking, Data Center operations, systems research, and much more.

The US base salary range for this full-time position is $132,000-$189,000 + bonus + equity + benefits. Our salary ranges are determined by role, level, and location. Within the range, individual pay is determined by work location and additional factors, including job-related skills, experience, and relevant education or training. Your recruiter can share more about the specific salary range for your preferred location during the hiring process.

Please note that the compensation details listed in US role postings reflect the base salary only, and do not include bonus, equity, or benefits. Learn more about benefits at Google.

Responsibilities:

  • Collaborate on data center hardware platforms across a wide range of domains, including semiconductors, test, Peripheral Component Interconnect Express (PCIe), power, CPU, xPU, power electronics, and networking.
  • Provide technical leadership by establishing priorities, conducting comprehensive root cause analyses, and resolving complex technical challenges to ensure fleet quality and a stable customer experience.
  • Optimize system health and repairability by improving Mean Time Between Failures (MTBF), managing swap rates, and developing advanced repair strategies.
  • Partner with System Software and Diagnostics/Test teams to enhance the detection, characterization, and resolution of fleet-scale hardware failures.
  • Lead the initiation and implementation of innovative product, process, and tool enhancement projects within complex cross-functional environments and integrate lessons learned from field performance data into New Product Introduction (NPI).

Google

A problem isn't truly solved until it's solved for all. Googlers build products that help create opportunities for everyone, whether down the street or across the globe. Bring your insight, imagination and a healthy disregard for the impossible. Bring everything that makes you unique. Together, we can build for everyone. Check out our career opportunities at goo.gle/3DLEokh

Runway Icon
Boost Your Interview Chances

With Runway

See Your Fit for This Role

1-5 min

Your Score

?

Top Applicants

90%

Your Job Search Advantage

Key Gaps & Next Steps:

Address these in your resume & Interview

Top Strengths For This Role

Highlight these in your cover letter & interview

Your Interview Guide

A Personalized Interview Strategy

Freshest Opportunities

Never Miss a Good Fit

Get notified when jobs mach your criteria