This version is still in development and is not considered stable yet. For the latest snapshot version, please use Spring AI 1.0.0-SNAPSHOT!

Ollama Embeddings

With Ollama you can run various AI Models locally and generate embeddings from them. An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.

The OllamaEmbeddingModel implementation lerverages the Ollama Embeddings API endpoint.

Prerequisites

You first need to run Ollama on your local machine. Refer to the official Ollama project README to get started running models on your local machine.

Auto-configuration

Spring AI provides Spring Boot auto-configuration for the Azure Ollama Embedding Model. To enable it add the following dependency to your Maven pom.xml file:

<dependency>
   <groupId>org.springframework.ai</groupId>
   <artifactId>spring-ai-ollama-spring-boot-starter</artifactId>
</dependency>

or to your Gradle build.gradle build file.

dependencies {
    implementation 'org.springframework.ai:spring-ai-ollama-spring-boot-starter'
}
Refer to the Dependency Management section to add the Spring AI BOM to your build file. Spring AI artifacts are published in Spring Milestone and Snapshot repositories. Refer to the Repositories section to add these repositories to your build system.

Embedding Properties

The prefix spring.ai.ollama is the property prefix to configure the connection to Ollama

Property Description Default

spring.ai.ollama.base-url

Base URL where Ollama API server is running.

localhost:11434

The prefix spring.ai.ollama.embedding.options is the property prefix that configures the Ollama embedding model . It includes the Ollama request (advanced) parameters such as the model, keep-alive, and truncate as well as the Ollama model options properties.

Here are the advanced request parameter for the Ollama embedding model:

Property

Description

Default

spring.ai.ollama.embedding.enabled

Enables the Ollama embedding model auto-configuration.

true

spring.ai.ollama.embedding.options.model

The name of the supported model to use. You can use dedicated Embedding Model types

mistral

spring.ai.ollama.embedding.options.keep_alive

Controls how long the model will stay loaded into memory following the request

5m

spring.ai.ollama.embedding.options.truncate

Truncates the end of each input to fit within context length. Returns error if false and context length is exceeded.

true

The remaining options properties are based on the Ollama Valid Parameters and Values and Ollama Types. The default values are based on: Ollama type defaults.

Property

Description

Default

spring.ai.ollama.embedding.options.numa

Whether to use NUMA.

false

spring.ai.ollama.embedding.options.num-ctx

Sets the size of the context window used to generate the next token.

2048

spring.ai.ollama.embedding.options.num-batch

Prompt processing maximum batch size.

512

spring.ai.ollama.embedding.options.num-gpu

The number of layers to send to the GPU(s). On macOS it defaults to 1 to enable metal support, 0 to disable. 1 here indicates that NumGPU should be set dynamically

-1

spring.ai.ollama.embedding.options.main-gpu

When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results.

0

spring.ai.ollama.embedding.options.low-vram

-

false

spring.ai.ollama.embedding.options.f16-kv

-

true

spring.ai.ollama.embedding.options.logits-all

Return logits for all the tokens, not just the last one. To enable completions to return logprobs, this must be true.

-

spring.ai.ollama.embedding.options.vocab-only

Load only the vocabulary, not the weights.

-

spring.ai.ollama.embedding.options.use-mmap

By default, models are mapped into memory, which allows the system to load only the necessary parts of the model as needed. However, if the model is larger than your total amount of RAM or if your system is low on available memory, using mmap might increase the risk of pageouts, negatively impacting performance. Disabling mmap results in slower load times but may reduce pageouts if you’re not using mlock. Note that if the model is larger than the total amount of RAM, turning off mmap would prevent the model from loading at all.

null

spring.ai.ollama.embedding.options.use-mlock

Lock the model in memory, preventing it from being swapped out when memory-mapped. This can improve performance but trades away some of the advantages of memory-mapping by requiring more RAM to run and potentially slowing down load times as the model loads into RAM.

false

spring.ai.ollama.embedding.options.num-thread

Sets the number of threads to use during computation. By default, Ollama will detect this for optimal performance. It is recommended to set this value to the number of physical CPU cores your system has (as opposed to the logical number of cores). 0 = let the runtime decide

0

spring.ai.ollama.embedding.options.num-keep

-

4

spring.ai.ollama.embedding.options.seed

Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt.

-1

spring.ai.ollama.embedding.options.num-predict

Maximum number of tokens to predict when generating text. (-1 = infinite generation, -2 = fill context)

-1

spring.ai.ollama.embedding.options.top-k

Reduces the probability of generating nonsense. A higher value (e.g., 100) will give more diverse answers, while a lower value (e.g., 10) will be more conservative.

40

spring.ai.ollama.embedding.options.top-p

Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text.

0.9

spring.ai.ollama.embedding.options.tfs-z

Tail-free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting.

1.0

spring.ai.ollama.embedding.options.typical-p

-

1.0

spring.ai.ollama.embedding.options.repeat-last-n

Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx)

64

spring.ai.ollama.embedding.options.temperature

The temperature of the model. Increasing the temperature will make the model answer more creatively.

0.8

spring.ai.ollama.embedding.options.repeat-penalty

Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient.

1.1

spring.ai.ollama.embedding.options.presence-penalty

-

0.0

spring.ai.ollama.embedding.options.frequency-penalty

-

0.0

spring.ai.ollama.embedding.options.mirostat

Enable Mirostat sampling for controlling perplexity. (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)

0

spring.ai.ollama.embedding.options.mirostat-tau

Controls the balance between coherence and diversity of the output. A lower value will result in more focused and coherent text.

5.0

spring.ai.ollama.embedding.options.mirostat-eta

Influences how quickly the algorithm responds to feedback from the generated text. A lower learning rate will result in slower adjustments, while a higher learning rate will make the algorithm more responsive.

0.1

spring.ai.ollama.embedding.options.penalize-newline

-

true

spring.ai.ollama.embedding.options.stop

Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return. Multiple stop patterns may be set by specifying multiple separate stop parameters in a modelfile.

-

spring.ai.ollama.embedding.options.functions

List of functions, identified by their names, to enable for function calling in a single prompt requests. Functions with those names must exist in the functionCallbacks registry.

-

All properties prefixed with spring.ai.ollama.embedding.options can be overridden at runtime by adding a request specific Runtime Options to the EmbeddingRequest call.

Runtime Options

The OllamaOptions.java provides the Ollama configurations, such as the model to use, the low level GPU and CPU tuning, etc.

The default options can be configured using the spring.ai.ollama.embedding.options properties as well.

At start-time use the OllamaEmbeddingModel(OllamaApi ollamaApi, OllamaOptions defaultOptions) to configure the default options used for all embedding requests. At run-time you can override the default options, using a OllamaOptions instance as part of your EmbeddingRequest.

For example to override the default model name for a specific request:

EmbeddingResponse embeddingResponse = embeddingModel.call(
    new EmbeddingRequest(List.of("Hello World", "World is big and salvation is near"),
        OllamaOptions.builder()
            .withModel("Different-Embedding-Model-Deployment-Name"))
            .withtTuncates(false)
            .build());

Sample Controller

This will create a EmbeddingModel implementation that you can inject into your class. Here is an example of a simple @Controller class that uses the EmbeddingModel implementation.

@RestController
public class EmbeddingController {

    private final EmbeddingModel embeddingModel;

    @Autowired
    public EmbeddingController(EmbeddingModel embeddingModel) {
        this.embeddingModel = embeddingModel;
    }

    @GetMapping("/ai/embedding")
    public Map embed(@RequestParam(value = "message", defaultValue = "Tell me a joke") String message) {
        EmbeddingResponse embeddingResponse = this.embeddingModel.embedForResponse(List.of(message));
        return Map.of("embedding", embeddingResponse);
    }
}

Manual Configuration

If you are not using Spring Boot, you can manually configure the OllamaEmbeddingModel. For this add the spring-ai-ollama dependency to your project’s Maven pom.xml file:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-ollama</artifactId>
</dependency>

or to your Gradle build.gradle build file.

dependencies {
    implementation 'org.springframework.ai:spring-ai-ollama'
}
Refer to the Dependency Management section to add the Spring AI BOM to your build file.
The spring-ai-ollama dependency provides access also to the OllamaChatModel. For more information about the OllamaChatModel refer to the Ollama Chat Client section.

Next, create an OllamaEmbeddingModel instance and use it to compute the embeddings for two input texts using a dedicated chroma/all-minilm-l6-v2-f32 embedding models:

var ollamaApi = new OllamaApi();

var embeddingModel = new OllamaEmbeddingModel(ollamaApi,
        OllamaOptions.builder()
			.withModel(OllamaModel.MISTRAL.id())
            .build()
            .toMap());

EmbeddingResponse embeddingResponse = embeddingModel.call(
    new EmbeddingRequest(List.of("Hello World", "World is big and salvation is near"),
        OllamaOptions.builder()
            .withModel("chroma/all-minilm-l6-v2-f32"))
            .withtTruncate(false)
            .build());

The OllamaOptions provides the configuration information for all embedding requests.