AWS is making it easier to track AI use for improved cost transparency – with Bedrock customers now able to break down inference costs by department, team, or application using AWS cost allocation tags. (Many organisations have been adopting third-party gateways for this.)
Users of Bedrock, a managed service that offers various LLMs via one API, can do this by creating an application inference profile and adding tags. AWS also added cross-region inference – so that customers can manage unplanned traffic bursts by using compute across different AWS Regions.
AWS said on November 7: “There's no additional routing cost for using cross-region inference. The price is calculated based on the region from which you call an inference profile… Cross-region inference requests are kept within the regions that are part of the inference profile [for example] a request made with an EU inference profile is kept within EU regions.”
Do read the small print, as ever.
The moves this month come as enterprises continue to refine their approach to AI application development and architectures – with spending still growing significantly in this space. (Amazon CEO Andy Jassy said Oct 31 that AWS sees firms using “multiple model types from different model providers and multiple model sizes in the same application.”)
That would chime in with a growth in a federated language model approach, particularly as organisations test agentic AI approaches, whereby organisations look to deploy a RAG agent locally powered by a small language model that is augmented by an LLM in the cloud.
How that works, in 7 steps...
Janakiram MSV has a tidy example in The New Stack.
Step 1: The user sends a prompt that needs access to the local databases exposed as APIs to the agent.
Step 2: Since the SLM running at the edge cannot map the prompt to functions and arguments, the agent — which acts as the orchestrator and the glue — sends the prompt, along with the available tools, to the LLM running in the cloud.
Step 3: The capable LLM responds with a set of tools — functions and arguments — to the orchestrator. The only job of this model running in the public domain is to break down the prompt into a list of functions.
Step 4: The agent enumerates the tools identified by the LLM and executes them in parallel. This essentially involves invoking the API that interfaces with the local databases and data sources, which are sensitive and confidential. The agent aggregates the response from the invoked functions and constructs the context.
Step 5: The agent then sends the original query submitted by the user, along with the aggregated context collected from the tools, to the SLM running at the edge.
Step 6: The SLM responds with factual information derived from the context sent by the orchestrator/agent. This is obviously limited to the context length of the SLM.
Step 7: The agent sends the final response to the original query sent by the user.
Looking ahead to 2025, Gartner’s John-David Lovelock noted this month that “CIOs in Europe will continue investing in public cloud end-user spending, which is estimated to reach $123 billion in 2024” (Gartner thinks spending on IT services for AI will hit $78 billion in 2024 in Europe.)
With regard to AWS’s inference cost allocation tags, some organisations are looking to avoid vendor lock-in buy deploying open-source gateways for this kind of control. For example, as The Stack reported last month, a platform engineering team at Bloomberg has teamed up with Tetrate, a cloud-native application networking specialist, to start building a fully open-source “AI Gateway” that can sit in front of multiple LLMs and handle authentication, rate limiting and other features for enterprise teams – they are building this around the open source Envoy Gateway.