Tensor Processing Units (TPUs) are Google custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs are designed from the ground up with the benefit of the deep Google experience and leadership in machine learning.

Cloud TPU runs your machine learning workloads on Google TPU accelerator hardware using TensorFlow. Cloud TPU is designed for maximum performance and flexibility to help researchers, developers, and businesses to build TensorFlow compute clusters that can leverage CPUs, GPUs, and TPUs. High-level Tensorflow APIs help you to get models running on the Cloud TPU hardware.

Setup

To set up the Google integration and discover the Google service, go to Google Integration Discovery Profile and select Tpu.

Supported metrics

New OpsRamp MetricGoogle MetricMetric Display NameUnitDescription
google_tpu_container_cpu_utilizationtpu.googleapis.com/container/cpu/utilizationContainer CPU utilization%Current CPU utilization of the docker container on the TPU worker. Sampled every 60 seconds. After sampling, data is not visible for up to 60 seconds.
google_tpu_container_memory_usagetpu.googleapis.com/container/memory/usageContainer memory usagebytesCurrent memory usage of the docker container on the TPU worker. Sampled every 60 seconds. After sampling, data is not visible for up to 60 seconds.
google_tpu_cpu_utilizationtpu.googleapis.com/cpu/utilizationCPU utilization%Current CPU utilization on the TPU worker, represented as a percentage. Values are typically numbers between 0.0 and 100.0, but might exceed 100.0. Sampled every 60 seconds. After sampling, data is not visible for up to 180 seconds.
google_tpu_memory_usagetpu.googleapis.com/memory/usageMemory usagebytesMemory usage in bytes. Sampled every 60 seconds. After sampling, data is not visible for up to 180 seconds.
google_tpu_network_received_bytes_counttpu.googleapis.com/network/received_bytes_countNetwork bytes receivedbytesCumulative bytes of data this server has received over the network. Sampled every 60 seconds. After sampling, data is not visible for up to 180 seconds.
google_tpu_network_sent_bytes_counttpu.googleapis.com/network/sent_bytes_countNetwork bytes sentbytesCumulative bytes of data this server has sent over the network. Sampled every 60 seconds. After sampling, data is not visible for up to 180 seconds.
google_tpu_tpu_mxu_utilizationtpu.googleapis.com/tpu/mxu/utilizationMXU utilization%Current MXU utilization on the TPU worker, represented as a percentage. Values are typically numbers between 0.0 and 100.0. Sampled every 60 seconds. After sampling, data is not visible for up to 180 seconds.
google_tpu_tpu_tensorcore_idle_durationtpu.googleapis.com/tpu/tensorcore/idle_durationTensorcore idle durationsecondsThe number of seconds tensorcore has been idle for. Sampled every 60 seconds. After sampling, data is not visible for up to 180 seconds.

Event support

  • Not Supported

External reference