Ollama Embeddings
With Ollama you can run various Large Language Models (LLMs) locally and generate embeddings from them.
Spring AI supports the Ollama text embeddings with OllamaEmbeddingClient
.
An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.
Prerequisites
You first need to run Ollama on your local machine.
Refer to the official Ollama project README to get started running models on your local machine.
Note, installing ollama run llama2
will download a 4GB docker image.
Add Repositories and BOM
Spring AI artifacts are published in Spring Milestone and Snapshot repositories. Refer to the Repositories section to add these repositories to your build system.
To help with dependency management, Spring AI provides a BOM (bill of materials) to ensure that a consistent version of Spring AI is used throughout the entire project. Refer to the Dependency Management section to add the Spring AI BOM to your build system.
Auto-configuration
Spring AI provides Spring Boot auto-configuration for the Azure Ollama Embedding Client.
To enable it add the following dependency to your Maven pom.xml
file:
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-ollama-spring-boot-starter</artifactId>
</dependency>
or to your Gradle build.gradle
build file.
dependencies {
implementation 'org.springframework.ai:spring-ai-ollama-spring-boot-starter'
}
Refer to the Dependency Management section to add the Spring AI BOM to your build file. |
The spring.ai.ollama.embedding.options.*
properties are used to configure the default options used for all embedding requests.
(It is used as OllamaEmbeddingClient#withDefaultOptions()
instance).
Embedding Properties
The prefix spring.ai.ollama
is the property prefix to configure the connection to Ollama
Property | Description | Default |
---|---|---|
spring.ai.ollama.base-url |
Base URL where Ollama API server is running. |
The prefix spring.ai.ollama.embedding.options
is the property prefix that configures the EmbeddingClient
implementation for Ollama.
Property | Description | Default |
---|---|---|
spring.ai.ollama.embedding.enabled |
Enable Ollama embedding client. |
true |
spring.ai.ollama.embedding.model (DEPRECATED) |
The name of the model to use. Deprecated use the |
mistral |
spring.ai.ollama.embedding.options.model |
The name of the supported models to use. |
mistral |
spring.ai.ollama.embedding.options.numa |
Whether to use NUMA. |
false |
spring.ai.ollama.embedding.options.num-ctx |
Sets the size of the context window used to generate the next token. |
2048 |
spring.ai.ollama.embedding.options.num-batch |
??? |
- |
spring.ai.ollama.embedding.options.num-gqa |
The number of GQA groups in the transformer layer. Required for some models, for example, it is 8 for llama2:70b. |
- |
spring.ai.ollama.embedding.options.num-gpu |
The number of layers to send to the GPU(s). On macOS it defaults to 1 to enable metal support, 0 to disable. |
- |
spring.ai.ollama.embedding.options.main-gpu |
??? |
- |
spring.ai.ollama.embedding.options.low-vram |
??? |
- |
spring.ai.ollama.embedding.options.f16-kv |
??? |
- |
spring.ai.ollama.embedding.options.logits-all |
??? |
- |
spring.ai.ollama.embedding.options.vocab-only |
??? |
- |
spring.ai.ollama.embedding.options.use-mmap |
??? |
- |
spring.ai.ollama.embedding.options.use-mlock |
??? |
- |
spring.ai.ollama.embedding.options.embedding-only |
??? |
- |
spring.ai.ollama.embedding.options.rope-frequency-base |
??? |
- |
spring.ai.ollama.embedding.options.rope-frequency-scale |
??? |
- |
spring.ai.ollama.embedding.options.num-thread |
Sets the number of threads to use during computation. By default, Ollama will detect this for optimal performance. It is recommended to set this value to the number of physical CPU cores your system has (as opposed to the logical number of cores). |
- |
spring.ai.ollama.embedding.options.num-keep |
??? |
- |
spring.ai.ollama.embedding.options.seed |
Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. |
0 |
spring.ai.ollama.embedding.options.num-predict |
Maximum number of tokens to predict when generating text. (Default: 128, -1 = infinite generation, -2 = fill context) |
128 |
spring.ai.ollama.embedding.options.top-k |
Reduces the probability of generating nonsense. A higher value (e.g., 100) will give more diverse answers, while a lower value (e.g., 10) will be more conservative. |
40 |
spring.ai.ollama.embedding.options.top-p |
Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. |
0.9 |
spring.ai.ollama.embedding.options.tfs-z |
Tail-free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting. |
1 |
spring.ai.ollama.embedding.options.typical-p |
??? |
- |
spring.ai.ollama.embedding.options.repeat-last-n |
Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx) |
64 |
spring.ai.ollama.embedding.options.temperature |
The temperature of the model. Increasing the temperature will make the model answer more creatively. |
0.8 |
spring.ai.ollama.embedding.options.repeat-penalty |
Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. |
1.1 |
spring.ai.ollama.embedding.options.presence-penalty |
??? |
- |
spring.ai.ollama.embedding.options.frequency-penalty |
??? |
- |
spring.ai.ollama.embedding.options.mirostat |
Enable Mirostat sampling for controlling perplexity. (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) |
0 |
spring.ai.ollama.embedding.options.mirostat-tau |
Influences how quickly the algorithm responds to feedback from the generated text. A lower learning rate will result in slower adjustments, while a higher learning rate will make the algorithm more responsive. |
0.1 |
spring.ai.ollama.embedding.options.mirostat-eta |
Controls the balance between coherence and diversity of the output. A lower value will result in more focused and coherent text. |
5.0 |
spring.ai.ollama.embedding.options.penalize-newline |
??? |
- |
spring.ai.ollama.embedding.options.stop |
Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return. Multiple stop patterns may be set by specifying multiple separate stop parameters in a modelfile. |
- |
The spring.ai.ollama.embedding.options.* properties are based on the Ollama Valid Parameters and Values and Ollama Types
|
All properties prefixed with spring.ai.ollama.embedding.options can be overridden at runtime by adding a request specific Embedding Options to the EmbeddingRequest call.
|
Embedding Options
The OllamaOptions.java provides the Ollama configurations, such as the model to use, the low level GPU and CPU tunning, etc.
The default options can be configured using the spring.ai.ollama.embedding.options
properties as well.
At start-time use the OllamaEmbeddingClient#withDefaultOptions()
to configure the default options used for all embedding requests.
At run-time you can override the default options, using a OllamaOptions
instance as part of your EmbeddingRequest
.
For example to override the default model name for a specific request:
EmbeddingResponse embeddingResponse = embeddingClient.call(
new EmbeddingRequest(List.of("Hello World", "World is big and salvation is near"),
OllamaOptions.create()
.withModel("Different-Embedding-Model-Deployment-Name"));
Sample Controller (Auto-configuration)
This will create a EmbeddingClient
implementation that you can inject into your class.
Here is an example of a simple @Controller
class that uses the EmbeddingClient
implementation.
@RestController
public class EmbeddingController {
private final EmbeddingClient embeddingClient;
@Autowired
public EmbeddingController(EmbeddingClient embeddingClient) {
this.embeddingClient = embeddingClient;
}
@GetMapping("/ai/embedding")
public Map embed(@RequestParam(value = "message", defaultValue = "Tell me a joke") String message) {
EmbeddingResponse embeddingResponse = this.embeddingClient.embedForResponse(List.of(message));
return Map.of("embedding", embeddingResponse);
}
}
Manual Configuration
If you are not using Spring Boot, you can manually configure the OllamaEmbeddingClient
.
For this add the spring-ai-ollama dependency to your project’s Maven pom.xml file:
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-ollama</artifactId>
</dependency>
or to your Gradle build.gradle
build file.
dependencies {
implementation 'org.springframework.ai:spring-ai-ollama'
}
Refer to the Dependency Management section to add the Spring AI BOM to your build file. |
The spring-ai-ollama dependency provides access also to the OllamaChatClient .
For more information about the OllamaChatClient refer to the Ollama Chat Client section.
|
Next, create an OllamaEmbeddingClient
instance and use it to compute the similarity between two input texts:
var ollamaApi = new OllamaApi();
var embeddingClient = new OllamaEmbeddingClient(ollamaApi)
.withDefaultOptions(OllamaOptions.create()
.withModel(OllamaOptions.DEFAULT_MODEL)
.toMap());
EmbeddingResponse embeddingResponse = embeddingClient
.embedForResponse(List.of("Hello World", "World is big and salvation is near"));
The OllamaOptions
provides the configuration information for all embedding requests.