Cache

Retrieve responses for previously addressed queries directly from the SimpliML cache, avoiding the need to resend them to your model endpoint again.

Whether matching exact strings or semantically similar ones, this process ensures requests are fulfilled up to 20 times faster and at a more cost-effective rate. Experience the benefits by leveraging the caching feature in SimpliML.

Cache Type

Semantic Cache: Our semantic cache in SimpliML evaluates contextual similarity between input prompts using cosine similarity. It determines if the similarity exceeds a specified threshold, and if so, SimpliML retrieves the response from the cache, reducing model execution time. For more in-depth information, explore our blog.

Simple Cache: The simple cache feature ensures an exact match on input prompts. If an identical request is received, SimpliML fetches the response directly from the cache, bypassing model execution. This straightforward approach is highly effective for repetitive identical requests.

Hybrid Cache: Our hybrid cache in SimpliML combines semantic cache and keyword search to rank and retrieve cached responses. This innovative approach enhances cache efficiency and retrieval based on both contextual similarity and exact matching.
(Currently Hybrid Cache is work in progress and cannot be used)

Cache Mode

All Model: In this mode, the response is retrieved from the cache of any models utilized across the platform.

Same Model: In this mode, the response is retrieved specifically from the cache of the model specified for the given inference request.

To enable cache, you need to add the config object to include in the Inference API request.

Here's a quick example of a config implementing cache

config object

    {
        "cache": {
            "enable": true,
            "type": "semantic",
            "mode": "all_model",
            "threshold": 0.90 
        }
    }

Option	Type	Description	Default	Required
enable	boolean	Enable the cache for the required query	-	Yes
type	string	Which caching technique to use from simple, semantic or hybrid	semantic	No
mode	string	Which mode to use from all_model or same_model	all_model	No
threshold	float	For semantic and hybrid types, specify the threshold that a cached request must surpass to retrieve its response. This specific threshold ensures that only requests with a certain degree of similarity are retrieved from the cache	0.90	No

Chat Completion API Fallbacks