AI & HPC Network Engineer
Design and deploy high-throughput, low-latency networks to support AI training and inference at scale. Troubleshoot Linux-based networking stacks across thousands of nodes and fine-tune kernel parameters for performance. Automate network provisioning, monitoring, and diagnostics using Python, Ansible, Terraform, and related tooling. Implement and manage L2/L3 topologies, EVPN-VLAN, BGP, OSPF, and handle Infiniband, RoCEv2, SR-IOV, and smartNIC deployments. Analyze performance metrics to identify packet loss, congestion, and jitter; integrate telemetry with Prometheus, Grafana, and sFlow/NetFlow. Collaborate with security and platform teams on segmentation and policy enforcement; participate in design reviews, capacity planning, and incident response; travel may be required.
Similar offers · 5
Save your favorite offers
Sign in to add this offer to your favorites.
