Skip to content

Same hybrid-search query, different places.

While developing and investigating use cases for CrateDB I often find myself doing hybrid-search in different places, sometimes it's a python script, a fastapi backend or some integration/tookit. The query is pretty much the same or very similar.

I'm constantly copy-pasting the query and adding new tests, I'm repeating myself all the time, with QueryZen I can develop once, test once and deploy once.

If there was some change in how we do hybrid-search in CrateDB, I would need to change it everywhere, with QueryZen I can just create a new Zen, and change the version in the applications.

Parametrizing the hybrid-search query.

The query can originally be found in: CrateDB Blog: Hybrid Search Explained.

WITH bm25 as (
  SELECT
    _score,
    RANK() OVER (
      ORDER BY
        _score DESC
    ) as rank,
    title
  FROM
    fs_search
  WHERE
    MATCH("content", 'knn search')
  ORDER BY
    _score DESC
),
vector as (
  SELECT
    _score,
    RANK() OVER (
      ORDER BY
        _score DESC
    ) as rank,
    title
  FROM
    fs_search
  WHERE
    KNN_MATCH(
      xs,
      [vector...],
      15
    )
)
SELECT
  TRUNC((1.0 / (bm25.rank + 60)) + (1.0 / (vector.rank + 60)), 6) as final_rank,
  bm25.rank as bm25_rank,
  vector.rank as vector_rank,
  bm25.title
FROM
  bm25,
  vector
WHERE
  bm25.title = vector.title
ORDER BY final_rank DESC

The first things that we can parametrize are the vector, and the match for bm25, 'knn search', we could and should parametrize more things like: the table names, the column names and the RRF k parameter, which is hardcoded as 60. For now we will only do the match terms.

from queryzen import QueryZen

qz = QueryZen()

query = """
WITH bm25 as (
  SELECT
    _score,
    RANK() OVER (
      ORDER BY
        _score DESC
    ) as rank,
    title
  FROM
    fs_search
  WHERE
    MATCH("content", :match_term)
  ORDER BY
    _score DESC
),
vector as (
  SELECT
    _score,
    RANK() OVER (
      ORDER BY
        _score DESC
    ) as rank,
    title
  FROM
    fs_search
  WHERE
    KNN_MATCH(
      xs,
      :vector,
      15
    )
)
SELECT
  TRUNC((1.0 / (bm25.rank + 60)) + (1.0 / (vector.rank + 60)), 6) as final_rank,
  bm25.rank as bm25_rank,
  vector.rank as vector_rank,
  bm25.title
FROM
  bm25,
  vector
WHERE
  bm25.title = vector.title
ORDER BY final_rank DESC
"""

qz.create(collection='crate', name='hybrid-search', query=query)

Now in any Python program I can reuse the query!

hybrid_search.py

from queryzen import QueryZen, exceptions
from .vectors import make_vector

qz = QueryZen()

def hybrid_search(search_phrase: str):
    """Searches a search_phrase in the Database, using hybrid search (BM25 + vector search)"""
    try:
        zen = qz.get(collection='crate', name='hybrid-search', version=1)
    except exceptions.ZenDoesNotExist:
        # handle zen not existing.
        pass

    result = qz.run(zen,
                    database='crate',
                    match_term=search_phrase,
                    vector=make_vector(search_phrase))    
    return result.rows

Now in my collection crate, there is a zen named hybrid_search with version 1, if I were to update the query, I could create a new one, with version 2, and test it. In the end application I would only need to change the version=1 to version=2.

Furthermore, I could just ship in a small package the hybrid_search function, and ship in my entire company's Python ecosystem the same way of doing hybrid search, centralized.