Apply directly to jobs in best companies
Search Companies / Jobs
 

Senior HPC Performance Engineer at NVIDIA
Santa Clara, United States


Job Descrption

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars.

We are the GPU Communications Libraries and Networking team at NVIDIA. We deliver libraries like NCCL, NVSHMEM, UCX for Deep Learning and HPC. We are looking for a motivated Performance engineer to influence the roadmap of our communication libraries. The DL and HPC applications of today have a huge compute demand and run on scales which go up to tens of thousands of GPUs. The GPUs are connected with high-speed interconnects (eg. NVLink, PCIe) within a node and with high-speed networking (eg. Infiniband, Ethernet) across the nodes. Communication performance between the GPUs has a direct impact on the end-to-end application performance; and the stakes are even higher at huge scales! This is an outstanding opportunity for someone with HPC and performance background to advance the state of the art in this space. Are you ready for to contribute to the development of innovative technologies and help realize NVIDIA's vision?

What you will be doing:

  • Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.

  • Study the interaction of our libraries with all HW (GPU, CPU, Networking) and SW components in the stack

  • Evaluate proof-of-concepts, conduct trade-off analysis when multiple solutions are available

  • Triage and root-cause performance issues reported by our customers

  • Collect a lot of performance data; build tools and infrastructure to visualize and analyze the information

  • Collaborate with a very dynamic team across multiple time zones

What we need to see:

  • M.S. (or equivalent experience) or PHD in Computer Science, or related field with relevant performance engineering and HPC experience

  • 3+ yrs of experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM)

  • Experience conducting performance benchmarking and triage on large scale HPC clusters

  • Good understanding of computer system architecture, HW-SW interactions and operating systems principles (aka systems software fundamentals)

  • Implement micro-benchmarks in C/C++, read and modify the code base when required

  • Ability to debug performance issues across the entire HW/SW stack. Proficient in a scripting language, preferably Python

  • Familiar with containers, cloud provisioning and scheduling tools (Kubernetes, SLURM, Ansible, Docker)

  • Adaptability and passion to learn new areas and tools. Flexibility to work and communicate effectively across different teams and timezones

Ways to stand out from the crowd:

  • Practical experience with Infiniband/Ethernet networks in areas like RDMA, topologies, congestion control

  • Experience debugging network issues in large scale deployments

  • Familiarity with CUDA programming and/or GPUs

  • Experience with Deep Learning Frameworks such PyTorch, TensorFlow

The base salary range is 148,000 USD - 276,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.


Complete form below to directly Send your CV / Linkedin Profile to Senior HPC Performance Engineer at NVIDIA.
@
You will receive all responses from employer on this email
Example: Application for the post of 'Accountant'
Example: Introduce your self and give purpose of your application
*All fields are mandatory.
NVIDIA
46 jobs found
Senior HPC Performance Engineer at NVIDIA
Santa Clara, United States
Global Head of Business Development at NVIDIA
Santa Clara, United States
Senior Mask Layout Design Engineer at NVIDIA
Santa Clara, United States
Solutions Architect, Hyperscale at NVIDIA
Santa Clara, United States
GPU Computing Capacity Optimization Engineer at NVIDIA
Santa Clara, United States
Strategic Account Manager, CSP - Networking at NVIDIA
Santa Clara, United States
Solutions Architect, AI Cloud Services at NVIDIA
Santa Clara, United States
Manager, Software Engineering - Cumulus Linux at NVIDIA
Santa Clara, United States
Senior Physical Design Methodology Engineer at NVIDIA
Santa Clara, United States
Business Development Lead, Healthcare and Med Tech - NALA at NVIDIA
Santa Clara, United States
1 2 3 4 5
13 Other Computer Hardware Manufacturing Companies Worldwide
Supermicro  
Computer Hardware Manufacturing
, United Arab Emirates
47 hiring managers available
1,001 employees work here
CORSAIR  
Computer Hardware Manufacturing
Wokingham, United Kingdom
8 hiring managers available
1,001 employees work here
Universal Quantum  
Computer Hardware Manufacturing
Haywards Heath, United Kingdom
11 hiring managers available
11 employees work here
Seagate Technology  
Computer Hardware Manufacturing
Derry, United Kingdom
14 hiring managers available
10,001 employees work here
Raspberry Pi  
Computer Hardware Manufacturing
Cambridge, United Kingdom
1 hiring managers available
51 employees work here
Western Digital  
Computer Hardware Manufacturing
Guildford, United Kingdom
65 hiring managers available
10,001 employees work here
Ivy Technology  
Computer Hardware Manufacturing
, United States
2 hiring managers available
1,001 employees work here
Lightmatter  
Computer Hardware Manufacturing
Mountain View, United States
5 hiring managers available
51 employees work here
Futronics  
Computer Hardware Manufacturing
Pasadena, United States
1 hiring managers available
1,001 employees work here
Solidigm  
Computer Hardware Manufacturing
Rancho Cordova, United States
32 hiring managers available
1,001 employees work here
1 2