TIL ·

I just found out about KServe v2 Protocol. Late to the party I guess. It looks like ML/AI serving APIs are being standardized. I remember when Justin Yan introduced the original model training and serving containers at Remitly. It certainly would have been nice to have standards like these back then. It looks like Triton Inference was one of the first to support this API.

It even has a k8s CRD that spin up inference servers specifically for LLM inference and supports the OpenAI completion API.

I do wonder whether this lowest common demonitator will work though. OpenAI now has the Responses API and Gemini has the Interactions API. Both the Responses and Interactions API have statefulness and reminds me of the history concept from langchain, which feels more natural for the multi-turn nature of chatbots and agents.

And, of course, there are the Realtime APIs and Live APIs.

I feel like I’m 15 again (browser wars) and everyone is inventing almost the same stuff, but just different enough.