Apache Cassandra

This section walks you through setting up CassandraVectorStore to store document embeddings and perform similarity searches.

What is Apache Cassandra ?

Apache Cassandra® is a true open source distributed database reknown for linear scalability, proven fault-tolerance and low latency, making it the perfect platform for mission-critical transactional data.

Its Vector Similarity Search (VSS) is based on the JVector library that ensures best-in-class performance and relevancy.

A vector search in Apache Cassandra is done as simply as:

SELECT content FROM table ORDER BY content_vector ANN OF query_embedding ;

More docs on this can be read here.

This Spring AI Vector Store is designed to work for both brand-new RAG applications and be able to be retrofitted on top of existing data and tables.

The store can also be used for non-RAG use-cases in an existing database, e.g. semantic searches, geo-proximity searches, etc.

The vector store implementation can initialize the requisite schema for you, but you must opt-in by specifying the initializeSchema boolean in the appropriate constructor or by setting …​initialize-schema=true in the application.properties file.

this is a breaking change! In earlier versions of Spring AI, this schema initialization happened by default.

The store will automatically create, or enhance, the schema as needed according to its configuration. If you don’t want the schema modifications, configure the store with disallowSchemaChanges.

What is JVector ?

JVector is a pure Java embedded vector search engine.

It stands out from other HNSW Vector Similarity Search implementations by being

  • Algorithmic-fast. JVector uses state of the art graph algorithms inspired by DiskANN and related research that offer high recall and low latency.

  • Implementation-fast. JVector uses the Panama SIMD API to accelerate index build and queries.

  • Memory efficient. JVector compresses vectors using product quantization so they can stay in memory during searches. (As part of our PQ implementation, our SIMD-accelerated kmeans class is 5x faster than the one in Apache Commons Math.)

  • Disk-aware. JVector’s disk layout is designed to do the minimum necessary iops at query time.

  • Concurrent. Index builds scale linearly to at least 32 threads. Double the threads, half the build time.

  • Incremental. Query your index as you build it. No delay between adding a vector and being able to find it in search results.

  • Easy to embed. API designed for easy embedding, by people using it in production.

Prerequisites

  1. A Embedding instance to compute the document embeddings. This is usually configured as a Spring Bean. Several options are available:

    • Transformers Embedding - computes the embedding in your local environment. The default is via ONNX and the all-MiniLM-L6-v2 Sentence Transformers. This just works.

    • If you want to use OpenAI’s Embeddings` - uses the OpenAI embedding endpoint. You need to create an account at OpenAI Signup and generate the api-key token at API Keys.

    • There are many more choices, see Embeddings API docs.

  2. An Apache Cassandra instance, from version 5.0-beta1

    1. DIY Quick Start

    2. For a managed offering Astra DB offers a healthy free tier offering.

Dependencies

Add these dependencies to your project:

  • For just the Cassandra Vector Store

<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-cassandra-store</artifactId>
</dependency>
  • Or, for everything you need in a RAG application (using the default ONNX Embedding Model)

<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-cassandra-store-spring-boot-starter</artifactId>
</dependency>
Refer to the Dependency Management section to add the Spring AI BOM to your build file.

Usage

Create a CassandraVectorStore instance connected to your Apache Cassandra database:

@Bean
public VectorStore vectorStore(EmbeddingModel embeddingModel) {

  CassandraVectorStoreConfig config = CassandraVectorStoreConfig.builder().build();

  return new CassandraVectorStore(config, embeddingModel);
}

The default configuration connects to Cassandra at localhost:9042 and will automatically create a default schema in keyspace springframework, table ai_vector_store.

The Cassandra Java Driver is easiest configured via an application.conf file on the classpath. More info here.

Then in your main code, create some documents:

List<Document> documents = List.of(
   new Document("Spring AI rocks!! Spring AI rocks!! Spring AI rocks!! Spring AI rocks!! Spring AI rocks!!", Map.of("country", "UK", "year", 2020)),
   new Document("The World is Big and Salvation Lurks Around the Corner", Map.of()),
   new Document("You walk forward facing the past and you turn back toward the future.", Map.of("country", "NL", "year", 2023)));

Now add the documents to your vector store:

vectorStore.add(documents);

And finally, retrieve documents similar to a query:

List<Document> results = vectorStore.similaritySearch(
   SearchRequest.query("Spring").withTopK(5));

If all goes well, you should retrieve the document containing the text "Spring AI rocks!!".

You can also limit results based on a similarity threshold:

List<Document> results = vectorStore.similaritySearch(
   SearchRequest.query("Spring").withTopK(5)
      .withSimilarityThreshold(0.5d));

Metadata filtering

You can leverage the generic, portable metadata filters with the CassandraVectorStore as well. Metadata columns must be configured in CassandraVectorStoreConfig.

For example, you can use either the text expression language:

vectorStore.similaritySearch(
   SearchRequest.query("The World").withTopK(TOP_K)
      .withFilterExpression("country in ['UK', 'NL'] && year >= 2020"));

or programmatically using the expression DSL:

Filter.Expression f = new FilterExpressionBuilder()
    .and(f.in("country", "UK", "NL"), f.gte("year", 2020)).build();

vectorStore.similaritySearch(
   SearchRequest.query("The World").withTopK(TOP_K)
      .withFilterExpression(f));

The portable filter expressions get automatically converted into CQL queries.

For metadata columns to be searchable they must be either primary keys or SAI indexed. To make non-primary-key columns indexed configure the metadata column with the SchemaColumnTags.INDEXED.

Advanced Example: Vector Store ontop full Wikipedia dataset

The following example demonstrates how to use the store on an existing schema. Here we use the schema from the github.com/datastax-labs/colbert-wikipedia-data project which comes with the full wikipedia dataset ready vectorised for you.

Usage

Create the schema in the Cassandra database first:

wget https://s.apache.org/colbert-wikipedia-schema-cql -O colbert-wikipedia-schema.cql

cqlsh -f colbert-wikipedia-schema.cql

Then configure the store like:

@Bean
public CassandraVectorStore store(EmbeddingModel embeddingModel) {

    List<SchemaColumn> partitionColumns = List.of(new SchemaColumn("wiki", DataTypes.TEXT),
            new SchemaColumn("language", DataTypes.TEXT), new SchemaColumn("title", DataTypes.TEXT));

    List<SchemaColumn> clusteringColumns = List.of(new SchemaColumn("chunk_no", DataTypes.INT),
            new SchemaColumn("bert_embedding_no", DataTypes.INT));

    List<SchemaColumn> extraColumns = List.of(new SchemaColumn("revision", DataTypes.INT),
            new SchemaColumn("id", DataTypes.INT));

    CassandraVectorStoreConfig conf = CassandraVectorStoreConfig.builder()
        .withKeyspaceName("wikidata")
        .withTableName("articles")
        .withPartitionKeys(partitionColumns)
        .withClusteringKeys(clusteringColumns)
        .withContentColumnName("body")
        .withEmbeddingColumndName("all_minilm_l6_v2_embedding")
        .withIndexName("all_minilm_l6_v2_ann")
        .disallowSchemaChanges()
        .addMetadataColumns(extraColumns)
        .withPrimaryKeyTranslator((List<Object> primaryKeys) -> {
            // the deliminator used to join fields together into the document's id is arbitary
            // here "§¶" is used
            if (primaryKeys.isEmpty()) {
                return "test§¶0";
            }
            return format("%s§¶%s", primaryKeys.get(2), primaryKeys.get(3));
        })
        .withDocumentIdTranslator((id) -> {
            String[] parts = id.split("§¶");
            String title = parts[0];
            int chunk_no = 0 < parts.length ? Integer.parseInt(parts[1]) : 0;
            return List.of("simplewiki", "en", title, chunk_no, 0);
        })
        .build();

    return new CassandraVectorStore(conf, embeddingModel());
}

@Bean
public EmbeddingModel embeddingModel() {
    // default is ONNX all-MiniLM-L6-v2 which is what we want
    return new TransformersEmbeddingModel();
}

Complete wikipedia dataset

And, if you would like to load the full wikipedia dataset. First download the simplewiki-sstable.tar from this link s.apache.org/simplewiki-sstable-tar . This will take a while, the file is tens of GBs.

tar -xf simplewiki-sstable.tar -C ${CASSANDRA_DATA}/data/wikidata/articles-*/

nodetool import wikidata articles ${CASSANDRA_DATA}/data/wikidata/articles-*/
If you have existing data in this table you’ll want to check the tarball’s files don’t clobber existing sstables when doing the tar.
An alternative to the nodetool import is to just restart Cassandra.
If there are any failures in the indexes they will be rebuilt automatically.