Trying to keep up with the pace of innovation in the generative AI world is a tough ask for all but the most laser-focused: new models (large and small) proliferate; records are broken; benchmarks challenged; emergent contenders claim unique performance at low cost on commodity hardware; hyperscalers stake their own claims to optimal performance…
This progress is exciting and affords great potential opportunity, but to Lin Qiao, founder and CEO of startup Fireworks AI, trying to harness your wagon to a Single Big Winner of the Model Wars™ is not the future – and for enterprise users, trying to stay on top of this is a job best left to technology partners. Instead, she says: “The next wave of quality is not going to be one of ‘single model solves all problems.’ There'll be hundreds of small expert models solving narrower sets of problems.”
Fireworks’ view, in short, is that AI application developers will use platforms that disaggregate tasks into sub-tasks allocated to the best model for the job. Each sub-task will be optimised for quality, cost, and latency; an approach dubbed “compound AI”.
It may sound avant-garde to those industry CTOs still tinkering about the edges of what OpenAI or equivalent models alone can do, but Lin, her team, and Fireworks’ investors are betting on the approach.
Making compound AI easy – and efficient
Getting this right sounds complex. Fireworks aims to abstract that complexity away – delivering customers fast, affordable, and highly performant generative AI capabilities farmed out to the best large or small model for the job across multiple modalities, and with an eye on solving the industry’s three main problems: cost, quality, and latency.
Firework’s customers can use its platform to tap hundreds of generative AI models across multiple modalities (text, video, you name it). It does the heavy-lifting of testing, onboarding and optimising models and the associated hardware and software tinkering to run smoothly. Customers can opt for generative AI software-as-a-service or (Lin says around 10% of customers take this approach) containerised deployments in private cloud.
Three main infrastructure options
Fireworks has developed an easy-to-use API to access an extensive range of models and users can pick three infrastructure options to run them:
1: Serverless (pick a model on pre-configured GPUs. Pay per token and avoid cold boots);
2: On-demand: Use private GPUs running Fireworks software that the firm claims offer ~250% improved throughput and 50% improved latency compared to vLLM;
3): Enterprise-ready hardware and software “personally tailored” by the Fireworks team for a customers’ use case, with bring-your-own-cloud (BYOC) deployment options, SLAs, et al.
In 2024, Fireworks also trained and released its own compound AI model, f1, which specialises in complex reasoning and interweaves multiple open models at the inference layer; early tests showed impressive results.
Again, this may not surprise its backers. Lin, for example, previously led the PyTorch team at Meta that, as investor Sequoia put it, “rebuilt the whole stack to meet the complex needs of the world's largest B2C company” and Fireworks includes many other Meta AI veterans.
Their experience means that Fireworks work well across everything from inference, to PyTorch runtime optimisation and low-level kernel optimisation, via device, CPU, and memory bandwidth optimisation.
MongoDB: An “end-to-end” AI partnership
One early investor and backer was MongoDB, which has formed a deep partnership with Fireworks. Along with AWS, Fireworks is a partner in MongoDB’s increasingly popular AI Applications Program (MAAP) and the three are collaborating closely to offer customers a one-stop-shop for enterprises looking to build secure, scalable generative AI applications.
With customers needing to unify operational data, unstructured data, and vector embeddings to build differentiated, secure AI applications – an area where MongoDB excels – and Fireworks AI offering a fine-tuning service using its CLI to ingest JSON-formatted objects from the likes of MongoDB Atlas, it’s a perfect combination for enterprise customers.
As Lin puts it: “Stepping back, AI models are not deterministic; models are probabilistic. The downside of that is that they can hallucinate.
“Sometimes that’s a feature for creative writing, maybe, but a lot of time it's a bug! The best way to reduce that is to give it a lot of context, for example through Retrieval Augmented Generation. Fireworks is the industry-leading GenAI inference platform; MongoDB is the industry-leading vector search engine for RAG. It’s a very natural match for developers to get access to the best end-to-end platform.”
For enterprise customers seeking ultra-low-latency use cases, meanwhile, Fireworks runs on NVIDIA H100 and A100 Tensor Core GPUs through Amazon EC2 P4 and P5 instances. The scale and sheer depth of capability of AWS infrastructure means it is an important partner.
Like many SaaS companies, Fireworks needs to straddle both partnership and competition. With AWS offering its own plurality of managed AI services, does it also pose a threat? The Stack puts the question to Lin: “I think if this risk always came true, then there would be no startups in the B2B business” she responds promptly: I think because we are running very lean, we're very agile, we're leaning into being the cutting edge, [at the] forefront of new development. I think we see things deeper and faster than hyperscalers, and that's where we are confident we can outrun them by providing the best developer platform for generative AI.”