The models powering OpenAI’s AI chatbots are a black box. But impressive free and open source ChatGPT alternatives that allow CTOs and their data teams to tune the models for bespoke use cases are landing fast – and even if many have research-only licenses, they suggest a clear path towards self-built alternatives.
This week Colossal-AI, a startup led by National University of Singapore Professor Yang You, joined them, open-sourcing a large language model (LLM) and chatbot dubbed “ColossalChat” under a permissive Apache 2.0 licence. You said that the model requires just 10 billion parameters* to attain bilingual proficiency in English and Chinese, with comparable results to GPT-3.5. It already has over 23,000 stars and 2,600 stars on GitHub.
He described Colossal-AI as the “first to open-source a complete RLHF [“reinforcement learning with human feedback”] pipeline that includes supervised data collection, supervised fine-tuning, reward model training, and reinforcement learning fine-tuning, based on the LLaMA pre-trained model” and ColossalChat as “the most practical open-source project that closely resembles the original ChatGPT technical solution!”
Its training also tapped 100,000 Q&A pairs in both English and Chinese that were “collected and cleaned from real-life question scenarios on social media platforms, serving as the seed dataset” and then “expanded using self-instruct technology, and annotation costs [that] were approximately $900.” You added that inference training only required 4GB GPU memory and a “small amount of computing power on a single server.”
"Many reasons for" open source ChatGPT alternatives
This week Databricks also open sourced Dolly, an LLM that it said “exhibits a surprising degree of the instruction following capabilities exhibited by ChatGPT,” adding: “We’re in the earliest days of the democratization of AI for the enterprise, and much work remains to be done, but we believe the technology underlying Dolly represents an exciting new opportunity for companies that want to cheaply build their own instruction-following models.”
(The underlying Alpaca dataset is licensed under a Creative Commons NonCommercial licence.)
Databricks, which provides a set of open source tools for building, deploying, sharing, and maintaining enterprise-grade data solutions, said: “There are many reasons a company would prefer to build their own model rather than sending data to a centralized LLM provider that serves a proprietary model behind an API.
“For many companies, the problems and datasets most likely to benefit from AI represent their most sensitive and proprietary intellectual property, and handing it over to a third party may be unpalatable. Furthermore, organizations may have different tradeoffs in terms of model quality, cost, and desired behavior. We believe that most ML users are best served long term by directly owning their models” the company added March 24.
Dolly was trained on the Databricks Machine Learning Platform using a two-years-old open source model (GPT-J) “subjected to just 30 minutes of fine tuning on a focused corpus of 50k records (Stanford Alpaca).”
Small is beautiful?
Colossal AI’s release was helped on its way by LLaMA, a set of large language models (LLMs) ranging from 7 billion to 65 billion parameters released by Meta in February under a non-commercial research licence. (Meta noted at the time that, unlike [DeepMind's] Chinchilla, [Google's] PaLM, or [OpenAI's] GPT-3, we only use publicly available data, making our work compatible with open-sourcing, while most existing models rely on data which is either not publicly available or undocumented (e.g. “Books – 2TB” or “Social media conversations”).
ColossalChat, released on March 29, includes a complete “reinforcement learning with human feedback” (RLHF) pipeline that Professor You said includes supervised data collection, supervised fine-tuning, reward model training, and reinforcement learning fine-tuning. He described it in a Medium post as “the first practical open-source project that includes a complete RLHF process for replicating ChatGPT-like models.”
See also -- Opinion: The Big Hallucination
Bloomberg meanwhile this week detailed its own creation of the BloombergGPT LLM, a model built on a 363 billion token financial documents dataset that was augmented with a 345 billion token public dataset to create a large training corpus with over 700 billion tokens: "Using a portion of this training corpus, the team trained a 50-billion parameter decoder-only causal language model. The resulting model was validated on existing finance-specific NLP benchmarks, a suite of Bloomberg internal benchmarks, and broad categories of general-purpose NLP tasks from popular benchmarks (e.g., BIG-bench Hard, Knowledge Assessments, Reading Comprehension, and Linguistic Tasks). Notably, the BloombergGPT model outperforms existing open models of a similar size on financial tasks by large margins, while still performing on par or better on general NLP benchmarks" Bloomberg said.
“For all the reasons generative LLMs are attractive – few-shot learning, text generation, conversational systems, etc. – we see tremendous value in having developed the first LLM focused on the financial domain,” added Bloomberg's CTO Shawn Edwards. “BloombergGPT will enable us to tackle many new types of applications, while it delivers much higher performance out-of-the-box than custom models for each application, at a faster time-to-market.”
“What is the implication of this trend?
Dr Peter Van Der Putten, Director AI Lab at Pegasystems: “Until recently, the race in large language models was all about who can build the largest models. Earlier this month Huaewei published PanGu-Σ, a 1.08 trillion parameter model… But there is another AI trend that has been rising to prominence – that smaller is beautiful.
“It started around a month ago when Meta published LlaMA... a sequence of small models quickly followed – all derived from LlaMA – going by colorful names such as Alpaca, Alpaca-LoRA, and CollossalChat… As LlaMA is restricted to non-commercial use, the same goes for these other models, but there have been other more open releases such as OpenChatKit [a set of models trained on the OIG-43M training dataset; a collaboration between Together, LAION, and Ontocord.ai] and Dolly, with the latter only a very thin training veneer on GPT-J.
"With a small data science team and some hundreds to thousands in compute you can self-host and finetune models like these towards your use cases as for many, pretraining from scratch is still too costly," he said.
"It may be worth it if for example your data is sensitive and/or proprietary, as you won’t have to call external services. That said, for quite a few use cases it will be acceptable to share prompts with a central service, and cloud vendors will allow clients to finetune base foundation models, without sharing the fine-tuning data with the base models. And to simplify further, many use cases can be achieved through clever automated prompt engineering without any fine-tuning.
He added: “For instance, we [Pegasystems] built a closed domain chatbot that can answer all kinds of questions about our software, based on our product documentation, without any tuning or training. It even passed many of the certifications for our product courses. Ultimately, the value is in creating specific apps for specific use cases."
Alright, explain the “parameters” thing again please?
The word “parameters” is bandied about a lot for LLMs. What does it really refer to?
We asked Victor Botev, CTO and co-founder of Iris AI (an award-winning startup that provides an AI engine for scientific text understanding along with a modular software suite) for his explainer. Here’s how he put it.
‘Parameters’ are mathematical coefficients in a machine learning model that it learns independently from historical training data. In NLP, they represent the likelihood of a certain feature of text appearing, in what order, using which character, and so on. By setting your parameters to different levels, you can fine-tune a model’s underlying structure to better fit the data and provide more accurate responses. Think of each parameter as a slider on a big audio mixing desk – except, in this case, there may be billions of sliders.
Each parameter tells the model how likely it is for a response to a given prompt to use specific punctuation, numbers, special characters, nouns, verbs, adjectives, and other features of text and in what frequency it will do so. The number of parameters has historically been used as a way to track the skill of a language model at a particular task. Indeed, they’re often used as a competitive measure: “Our model has 100 billion parameters, whereas our competitor’s model only has 25 billion, and so our model is better at such-and-such task.”
When it’s said that a model has ‘X billion parameters’, what that really means is that, every time you submit a prompt, all ‘X billion’ parameters are used as it generates a response. Ideally, this means you get a better, more fine-tuned response. Some companies adopt the view that ‘the bigger, the better’ and treat the number of a model’s parameters and its ability to generate text as a linear relationship.
However, this misses a critical nuance. The more parameters a model has, the better it is at certain tasks – and generating text is by no means a single task. It contains many different parts, such as sentence boundary disambiguation, part-of-speech tagging, and word sense disambiguation, not to mention factual validation. An extremely high number of parameters may give a language model (or Large Language Model, LLM) the context and training to generate plausible, ‘human-sounding’ text in response to most prompts. That’s because it can predict with high accuracy what a human’s response would look like — but quantity does not equal quality.
“The chances of these responses containing factual errors remains high unless there is ample high quality data to train those parameters. Because more parameters means you need more examples, you consequently need to add in more context to your training data to ensure the models abstractions don’t become unmoored from the exact facts they need to produce.”