Ollama Chat
With Ollama you can run various Large Language Models (LLMs) locally and generate text from them.
Spring AI supports the Ollama chat completion capabilities with the OllamaChatModel
API.
Ollama offers an OpenAI API compatible endpoint as well. The OpenAI API compatibility section explains how to use the Spring AI OpenAI to connect to an Ollama server. |
Prerequisites
You first need access to an Ollama instance. There are a few options, including the following:
-
Download and install Ollama on your local machine.
-
Configure and run Ollama via Testcontainers.
-
Bind to an Ollama instance via Kubernetes Service Bindings.
You can pull the models you want to use in your application from the Ollama model library:
ollama pull <model-name>
You can also pull any of the thousands, free, GGUF Hugging Face Models:
ollama pull hf.co/<username>/<model-repository>
Alternatively, you can enable the option to download automatically any needed model: Auto-pulling Models.
Auto-configuration
Spring AI provides Spring Boot auto-configuration for the Ollama chat integration.
To enable it add the following dependency to your project’s Maven pom.xml
or Gradle build.gradle
build files:
-
Maven
-
Gradle
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-ollama-spring-boot-starter</artifactId>
</dependency>
dependencies {
implementation 'org.springframework.ai:spring-ai-ollama-spring-boot-starter'
}
Refer to the Dependency Management section to add the Spring AI BOM to your build file. |
Base Properties
The prefix spring.ai.ollama
is the property prefix to configure the connection to Ollama.
Property |
Description |
Default |
spring.ai.ollama.base-url |
Base URL where Ollama API server is running. |
Here are the properties for initializing the Ollama integration and auto-pulling models.
Property |
Description |
Default |
spring.ai.ollama.init.pull-model-strategy |
Whether to pull models at startup-time and how. |
|
spring.ai.ollama.init.timeout |
How long to wait for a model to be pulled. |
|
spring.ai.ollama.init.max-retries |
Maximum number of retries for the model pull operation. |
|
spring.ai.ollama.init.chat.include |
Include this type of models in the initialization task. |
|
spring.ai.ollama.init.chat.additional-models |
Additional models to initialize besides the ones configured via default properties. |
|
Chat Properties
The prefix spring.ai.ollama.chat.options
is the property prefix that configures the Ollama chat model.
It includes the Ollama request (advanced) parameters such as the model
, keep-alive
, and format
as well as the Ollama model options
properties.
Here are the advanced request parameter for the Ollama chat model:
Property |
Description |
Default |
spring.ai.ollama.chat.enabled |
Enable Ollama chat model. |
true |
spring.ai.ollama.chat.options.model |
The name of the supported model to use. |
mistral |
spring.ai.ollama.chat.options.format |
The format to return a response in. Currently, the only accepted value is |
- |
spring.ai.ollama.chat.options.keep_alive |
Controls how long the model will stay loaded into memory following the request |
5m |
The remaining options
properties are based on the Ollama Valid Parameters and Values and Ollama Types. The default values are based on the Ollama Types Defaults.
Property |
Description |
Default |
spring.ai.ollama.chat.options.numa |
Whether to use NUMA. |
false |
spring.ai.ollama.chat.options.num-ctx |
Sets the size of the context window used to generate the next token. |
2048 |
spring.ai.ollama.chat.options.num-batch |
Prompt processing maximum batch size. |
512 |
spring.ai.ollama.chat.options.num-gpu |
The number of layers to send to the GPU(s). On macOS it defaults to 1 to enable metal support, 0 to disable. 1 here indicates that NumGPU should be set dynamically |
-1 |
spring.ai.ollama.chat.options.main-gpu |
When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. |
0 |
spring.ai.ollama.chat.options.low-vram |
- |
false |
spring.ai.ollama.chat.options.f16-kv |
- |
true |
spring.ai.ollama.chat.options.logits-all |
Return logits for all the tokens, not just the last one. To enable completions to return logprobs, this must be true. |
- |
spring.ai.ollama.chat.options.vocab-only |
Load only the vocabulary, not the weights. |
- |
spring.ai.ollama.chat.options.use-mmap |
By default, models are mapped into memory, which allows the system to load only the necessary parts of the model as needed. However, if the model is larger than your total amount of RAM or if your system is low on available memory, using mmap might increase the risk of pageouts, negatively impacting performance. Disabling mmap results in slower load times but may reduce pageouts if you’re not using mlock. Note that if the model is larger than the total amount of RAM, turning off mmap would prevent the model from loading at all. |
null |
spring.ai.ollama.chat.options.use-mlock |
Lock the model in memory, preventing it from being swapped out when memory-mapped. This can improve performance but trades away some of the advantages of memory-mapping by requiring more RAM to run and potentially slowing down load times as the model loads into RAM. |
false |
spring.ai.ollama.chat.options.num-thread |
Sets the number of threads to use during computation. By default, Ollama will detect this for optimal performance. It is recommended to set this value to the number of physical CPU cores your system has (as opposed to the logical number of cores). 0 = let the runtime decide |
0 |
spring.ai.ollama.chat.options.num-keep |
- |
4 |
spring.ai.ollama.chat.options.seed |
Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. |
-1 |
spring.ai.ollama.chat.options.num-predict |
Maximum number of tokens to predict when generating text. (-1 = infinite generation, -2 = fill context) |
-1 |
spring.ai.ollama.chat.options.top-k |
Reduces the probability of generating nonsense. A higher value (e.g., 100) will give more diverse answers, while a lower value (e.g., 10) will be more conservative. |
40 |
spring.ai.ollama.chat.options.top-p |
Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. |
0.9 |
spring.ai.ollama.chat.options.tfs-z |
Tail-free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting. |
1.0 |
spring.ai.ollama.chat.options.typical-p |
- |
1.0 |
spring.ai.ollama.chat.options.repeat-last-n |
Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx) |
64 |
spring.ai.ollama.chat.options.temperature |
The temperature of the model. Increasing the temperature will make the model answer more creatively. |
0.8 |
spring.ai.ollama.chat.options.repeat-penalty |
Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. |
1.1 |
spring.ai.ollama.chat.options.presence-penalty |
- |
0.0 |
spring.ai.ollama.chat.options.frequency-penalty |
- |
0.0 |
spring.ai.ollama.chat.options.mirostat |
Enable Mirostat sampling for controlling perplexity. (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) |
0 |
spring.ai.ollama.chat.options.mirostat-tau |
Controls the balance between coherence and diversity of the output. A lower value will result in more focused and coherent text. |
5.0 |
spring.ai.ollama.chat.options.mirostat-eta |
Influences how quickly the algorithm responds to feedback from the generated text. A lower learning rate will result in slower adjustments, while a higher learning rate will make the algorithm more responsive. |
0.1 |
spring.ai.ollama.chat.options.penalize-newline |
- |
true |
spring.ai.ollama.chat.options.stop |
Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return. Multiple stop patterns may be set by specifying multiple separate stop parameters in a modelfile. |
- |
spring.ai.ollama.chat.options.functions |
List of functions, identified by their names, to enable for function calling in a single prompt requests. Functions with those names must exist in the functionCallbacks registry. |
- |
spring.ai.ollama.chat.options.proxy-tool-calls |
If true, the Spring AI will not handle the function calls internally, but will proxy them to the client. Then is the client’s responsibility to handle the function calls, dispatch them to the appropriate function, and return the results. If false (the default), the Spring AI will handle the function calls internally. Applicable only for chat models with function calling support |
false |
All properties prefixed with spring.ai.ollama.chat.options can be overridden at runtime by adding request-specific Runtime Options to the Prompt call.
|
Runtime Options
The OllamaOptions.java class provides model configurations, such as the model to use, the temperature, etc.
On start-up, the default options can be configured with the OllamaChatModel(api, options)
constructor or the spring.ai.ollama.chat.options.*
properties.
At run-time, you can override the default options by adding new, request-specific options to the Prompt
call.
For example, to override the default model and temperature for a specific request:
ChatResponse response = chatModel.call(
new Prompt(
"Generate the names of 5 famous pirates.",
OllamaOptions.builder()
.withModel(OllamaModel.LLAMA3_1)
.withTemperature(0.4)
.build()
));
In addition to the model specific OllamaOptions you can use a portable ChatOptions instance, created with ChatOptionsBuilder#builder(). |
Auto-pulling Models
Spring AI Ollama can automatically pull models when they are not available in your Ollama instance. This feature is particularly useful for development and testing as well as for deploying your applications to new environments.
You can also pull, by name, any of the thousands, free, GGUF Hugging Face Models. |
There are three strategies for pulling models:
-
always
(defined inPullModelStrategy.ALWAYS
): Always pull the model, even if it’s already available. Useful to ensure you’re using the latest version of the model. -
when_missing
(defined inPullModelStrategy.WHEN_MISSING
): Only pull the model if it’s not already available. This may result in using an older version of the model. -
never
(defined inPullModelStrategy.NEVER
): Never pull the model automatically.
Due to potential delays while downloading models, automatic pulling is not recommended for production environments. Instead, consider assessing and pre-downloading the necessary models in advance. |
All models defined via configuration properties and default options can be automatically pulled at startup time. You can configure the pull strategy, timeout, and maximum number of retries using configuration properties:
spring:
ai:
ollama:
init:
pull-model-strategy: always
timeout: 60s
max-retries: 1
The application will not complete its initialization until all specified models are available in Ollama. Depending on the model size and internet connection speed, this may significantly slow down your application’s startup time. |
You can initialize additional models at startup, which is useful for models used dynamically at runtime:
spring:
ai:
ollama:
init:
pull-model-strategy: always
chat:
additional-models:
- llama3.2
- qwen2.5
If you want to apply the pulling strategy only to specific types of models, you can exclude chat models from the initialization task:
spring:
ai:
ollama:
init:
pull-model-strategy: always
chat:
include: false
This configuration will apply the pulling strategy to all models except chat models.
Function Calling
You can register custom Java functions with the OllamaChatModel
and have the Ollama model intelligently choose to output a JSON object containing arguments to call one or many of the registered functions.
This is a powerful technique to connect the LLM capabilities with external tools and APIs.
Read more about Ollama Function Calling.
You need Ollama 0.2.8 or newer to use the functional calling capabilities. |
Currently, the Ollama API (0.3.8) does not support function calling in streaming mode. |
Multimodal
Multimodality refers to a model’s ability to simultaneously understand and process information from various sources, including text, images, audio, and other data formats.
Some of the models available in Ollama with multimodality support are LLaVa and bakllava (see the full list). For further details, refer to the LLaVA: Large Language and Vision Assistant.
The Ollama Message API provides an "images" parameter to incorporate a list of base64-encoded images with the message.
Spring AI’s Message interface facilitates multimodal AI models by introducing the Media type.
This type encompasses data and details regarding media attachments in messages, utilizing Spring’s org.springframework.util.MimeType
and a org.springframework.core.io.Resource
for the raw media data.
Below is a straightforward code example excerpted from OllamaChatModelMultimodalIT.java, illustrating the fusion of user text with an image.
var imageResource = new ClassPathResource("/multimodal.test.png");
var userMessage = new UserMessage("Explain what do you see on this picture?",
new Media(MimeTypeUtils.IMAGE_PNG, this.imageResource));
ChatResponse response = chatModel.call(new Prompt(this.userMessage,
OllamaOptions.builder().withModel(OllamaModel.LLAVA)).build());
The example shows a model taking as an input the multimodal.test.png
image:
along with the text message "Explain what do you see on this picture?", and generating a response like this:
The image shows a small metal basket filled with ripe bananas and red apples. The basket is placed on a surface, which appears to be a table or countertop, as there's a hint of what seems like a kitchen cabinet or drawer in the background. There's also a gold-colored ring visible behind the basket, which could indicate that this photo was taken in an area with metallic decorations or fixtures. The overall setting suggests a home environment where fruits are being displayed, possibly for convenience or aesthetic purposes.
OpenAI API Compatibility
Ollama is OpenAI API-compatible and you can use the Spring AI OpenAI client to talk to Ollama and use tools.
For this, you need to configure the OpenAI base URL to your Ollama instance: spring.ai.openai.chat.base-url=http://localhost:11434
and select one of the provided Ollama models: spring.ai.openai.chat.options.model=mistral
.
Check the OllamaWithOpenAiChatModelIT.java tests for examples of using Ollama over Spring AI OpenAI.
Sample Controller
Create a new Spring Boot project and add the spring-ai-ollama-spring-boot-starter
to your pom (or gradle) dependencies.
Add a application.yaml
file, under the src/main/resources
directory, to enable and configure the Ollama chat model:
spring:
ai:
ollama:
base-url: http://localhost:11434
chat:
options:
model: mistral
temperature: 0.7
Replace the base-url with your Ollama server URL.
|
This will create an OllamaChatModel
implementation that you can inject into your classes.
Here is an example of a simple @RestController
class that uses the chat model for text generations.
@RestController
public class ChatController {
private final OllamaChatModel chatModel;
@Autowired
public ChatController(OllamaChatModel chatModel) {
this.chatModel = chatModel;
}
@GetMapping("/ai/generate")
public Map<String,String> generate(@RequestParam(value = "message", defaultValue = "Tell me a joke") String message) {
return Map.of("generation", this.chatModel.call(message));
}
@GetMapping("/ai/generateStream")
public Flux<ChatResponse> generateStream(@RequestParam(value = "message", defaultValue = "Tell me a joke") String message) {
Prompt prompt = new Prompt(new UserMessage(message));
return this.chatModel.stream(prompt);
}
}
Manual Configuration
If you don’t want to use the Spring Boot auto-configuration, you can manually configure the OllamaChatModel
in your application.
The OllamaChatModel implements the ChatModel
and StreamingChatModel
and uses the Low-level OllamaApi Client to connect to the Ollama service.
To use it, add the spring-ai-ollama
dependency to your project’s Maven pom.xml
or Gradle build.gradle
build files:
-
Maven
-
Gradle
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-ollama</artifactId>
</dependency>
dependencies {
implementation 'org.springframework.ai:spring-ai-ollama'
}
Refer to the Dependency Management section to add the Spring AI BOM to your build file. |
The spring-ai-ollama dependency provides access also to the OllamaEmbeddingModel .
For more information about the OllamaEmbeddingModel refer to the Ollama Embedding Model section.
|
Next, create an OllamaChatModel
instance and use it to send requests for text generation:
var ollamaApi = new OllamaApi();
var chatModel = new OllamaChatModel(this.ollamaApi,
OllamaOptions.create()
.withModel(OllamaOptions.DEFAULT_MODEL)
.withTemperature(0.9));
ChatResponse response = this.chatModel.call(
new Prompt("Generate the names of 5 famous pirates."));
// Or with streaming responses
Flux<ChatResponse> response = this.chatModel.stream(
new Prompt("Generate the names of 5 famous pirates."));
The OllamaOptions
provides the configuration information for all chat requests.
Low-level OllamaApi Client
The OllamaApi provides a lightweight Java client for the Ollama Chat Completion API Ollama Chat Completion API.
The following class diagram illustrates the OllamaApi
chat interfaces and building blocks:
The OllamaApi is a low-level API and is not recommended for direct use. Use the OllamaChatModel instead.
|
Here is a simple snippet showing how to use the API programmatically:
OllamaApi ollamaApi = new OllamaApi("YOUR_HOST:YOUR_PORT");
// Sync request
var request = ChatRequest.builder("orca-mini")
.withStream(false) // not streaming
.withMessages(List.of(
Message.builder(Role.SYSTEM)
.withContent("You are a geography teacher. You are talking to a student.")
.build(),
Message.builder(Role.USER)
.withContent("What is the capital of Bulgaria and what is the size? "
+ "What is the national anthem?")
.build()))
.withOptions(OllamaOptions.create().withTemperature(0.9))
.build();
ChatResponse response = this.ollamaApi.chat(this.request);
// Streaming request
var request2 = ChatRequest.builder("orca-mini")
.withStream(true) // streaming
.withMessages(List.of(Message.builder(Role.USER)
.withContent("What is the capital of Bulgaria and what is the size? " + "What is the national anthem?")
.build()))
.withOptions(OllamaOptions.create().withTemperature(0.9).toMap())
.build();
Flux<ChatResponse> streamingResponse = this.ollamaApi.streamingChat(this.request2);