Dropbox's vector search: Challenges and model choices...

Search is a hard problem. Keyword search leaves a lot of queries unmet. But organisations from the Financial Times to Dropbox are developing AI-powered semantic search capabilities that “understand” the complex relationship between user queries and diverse document content.

That way, even if a keyword or keyphrase doesn’t have a direct result, the engine is able to surface associated highly relevant responses – e.g. a search for “employment contract” find “offer letter” documents.

For a company like Dropbox – home to over a trillion documents and exabytes of data – using an AI model to run vector searches that help to deliver this more advanced search could get prohibitionally expensive computationally, so when it set out to build this feature, it chose carefully.

The storage firm has committed to rolling out semantic search/vector search in “early 2025” for all business users after testing it for internal users in early 2024, externally for a subset of Pro and Essential users in May 2024 then taking it GA for all Pro and Essential users in August 2024. (It cut empty search sessions by 17% and lifted search success by 2%.)

Dropbox vector search

And explaining its progress in a technical blog, Dropbox’s team this said that they had adopted a modified version of the open-source multilingual-e5-large text embedding model developed by Microsoft (via the XLM-RoBERTa-large architecture) and released in 2023, to power semantic search, which it is taking universally GA in early 2025.

Vector search turns content into numbers or “embeddings” that can be tailored, using machine learning, for different content and use cases. It then does the same thing for user queries, to “retrieve results that align with the query's intent, rather than literal terms” as Dropbox puts it.

Dropbox tested models against the the Massive Text Embedding Benchmark (MTEB) is an open-source benchmark that assesses document embedding models across eight evaluation tasks and 56 datasets,

Among the tweaks it made to its chosen model were the following:

Customizations: “We added adapters to enable the evaluation of models running in our in-house inference services, in addition to models executed inline. We also added adapters to allow streaming datasets that reside in our infrastructure…”
Chunking: “MTEB by default presumes that a single embedding is generated per document [but] our user’s document can range from very tiny to very large… We implemented various nuanced strategies for… chunking, a given document into individual chunks.
Storage: To optimize storage costs, we reduced the precision (full 32-bit floating point, half float, quarter float, fixed point of various bit depth) and dimensionality… of MTEB's full-sized embeddings.
Files: “Documents in Dropbox are named by our users—and filenames are quite significant in retrieval tasks. We crafted various approaches to incorporate embeddings of the filenames into MTEB when applicable…”

Dropbox rolls a lot of its own infrastructure after famously moving off the cloud in 2015-2017 and as such its approach will be closely watched.

The company said on December 11: “We plan to introduce the concept of semantic search for multiple file types, but in this first iteration of semantic search at Dropbox, we focused on supporting only text files.”

The news comes as Dropbox executives said on a November earnings call (as it reported $639 million in quarterly revenue) that it was “at an inflection point as a company. Our core FSS [file sync and share] business has matured and we've been investing in new products to solve new problems and drive growth…given the challenging environment… we needed to better align our investments with the opportunities ahead.

It laid off 20% of its staff this year and admitted the year’s results “benefited from a $30 million tailwind through the extension of the useful life of our data center hardware” but continues to support some 500,000 business accounts and a better part of 20 million subscribers on Dropbox.

Dropbox vector search

See also: The Big Interview: Elastic CTO Shay Banon on suing AWS, returning to OSS, and GenAI