What's New, ModelMesh? Model Serving at Scale - Rafael Vasquez, IBM
Kubernetes is a natural choice for deploying AI models. However, concerns regarding model management often arise along with the need to maximize cluster resources while minimizing cost. Typical approaches may leverage Istio and Knative through a single-model-per-container paradigm, but it’s quite easy to deplete cluster resources and hit pod or IP address limits when dealing with many models at scale. Fortunately, there is a way to tackle these obstacles: ModelMesh, the multi-model serving backend for KServe. With a small control-plane footprint, this open source solution offers a way to host a myriad of models while employing a distributed LRU cache to intelligently load and unload models to and from memory based on current usage. Moreover, ModelMesh provides routing capabilities that balance inference requests between copies of a model. Recently, ModelMesh delivered a new major release (v0.10) as it continues to integrate itself as KServe's multi-model serving backend. In this talk, one can expect to learn how AI models can be deployed on Kubernetes in a scalable manner for high-performing, high-density model serving and experience growing capabilities such as newly-supported leading model runtimes such as TorchServe and support for runtime-sharing across namespaces.
0 Comments