Unlocking the Potential of AI: How Databricks Dolly is Democratizing LLMs

As the world continues to generate massive amounts of data, artificial intelligence (AI) is becoming increasingly important in helping businesses and organizations make sense of it all. One of the biggest challenges in AI development is the creation of large language models that can process and analyze vast amounts of text data. That’s where Databricks Dolly comes in. This new project from Databricks is set to revolutionize the way language models are developed and deployed, paving the way for more sophisticated NLP models and advancing the future of AI technology. In the article “Unlocking the Potential of AI: How Databricks Dolly is Democratizing LLMs“, we’ll dive deeper into what makes Databricks Dolly so special and explore the potential impact it could have on the future of AI.

Table of Contents

What is the Large Language Model?

Before going further to Databricks Dolly, let’s understand what is Large Language Model(LLM) is.

A large language model is an artificial intelligence system that is designed to generate human-like language and understand natural language processing (NLP) tasks. These models are built using deep learning techniques, typically using a neural network architecture, and trained on large amounts of text data.

The main goal of large language models is to be able to generate coherent and relevant text based on a given prompt or input. This has a wide range of applications, including language translation, chatbots, speech recognition, and even creative writing.

One of the most well-known examples of a large language model is GPT-4 (Generative Pre-trained Transformer 4), which was released by OpenAI in 2023. GPT-4 has been trained on an enormous corpus of text data and is capable of generating highly coherent and contextually appropriate text in response to a wide range of prompts and queries.

LLMs, or large language models, have gained significant popularity due to their broad applicability for various natural language processing (NLP) tasks. These tasks include

  1. Text generation, which allows LLMs to generate text on any topic they have been trained on.
  2. Translation, where LLMs trained in multiple languages can translate from one language to another.
  3. Summarizing blocks or multiple pages of text, as well as rewriting a section of text.
  4. Classify and categorize content and perform sentiment analysis to help users understand the intent of a piece of content or response.
  5. Conversational AI and chatbots enable more natural and fluid conversations with users, surpassing the capabilities of older generations of AI technologies.

The project builds upon the recent advancements in large language models, such as ChatGPT, Google Bard, Bing, etc. aims to create a unified platform for training, fine-tuning, and deploying such models at scale. But, do you know how they work or how they generate answers for us? Do you know where the data is stored after generating the answers/queries?

Let’s take the example of the most popular LLM model ChatGPT. ChatGPT is a large language model that operates similarly to other LLMs, but with a key difference: it stores user data on external servers, which makes the cloud accessible to unauthorized users and exposes a significant risk of sensitive information being compromised. Additionally, ChatGPT may generate offensive, discriminatory, or non-neutral content towards certain individuals, and it may also produce irrelevant information that could potentially harm users and result in incorrect decisions.

As security concerns around chatbots continue to grow, the demand for open-source alternatives that prioritize accuracy, security, and impartiality has increased. Addressing these requirements, the Databricks team conducted a thorough analysis of user concerns and developed Databricks Dolly – an open-source chatbot that upholds these principles and exhibits exceptional performance in various use cases.

Difficulties in Adopting Large Language Models

There are certain challenges to implementing LLM in an enterprise.

ChallengeWhy it is a challenge?
The need for speedYour rivals are also endeavoring to develop LLMs, and you must ensure that you stay ahead in the competition. You need to focus on high-value use cases, but how can you achieve this goal quickly and efficiently?
Customization, Control, and Security of LLMsEmploying proprietary SaaS LLMs compels you to disclose your data to third parties, leaving you without a competitive advantage. The issue at hand is how to customize an LLM that you own and control, utilizing your proprietary data.
Integrating LLMs with Existing DataSimilar to other machine learning methods, LLMs necessitate a close relationship with your existing data strategy. The difficulty lies in determining the best approach to connect LLMs with your current data infrastructure.

Propriety SaaS LLMs vs Open Source LLMs

Let’s compare and contrast the Propriety of SaaS LLMs vs Open Source LLMs.This will help you understand which one is better and why?

FeaturePropriety SaaS LLMsOpen Source LLMs
Control/ownershipControlled and Owned by Vendors.Completely Open: You Have Full Ownership Over Customizations.
Security/privacyData Exits Your Databricks Environment.Data Remains Within Your Databricks Environment.
CustomizationDependent on Vendors.Completely Customizable.
TransparencyThe way the model works and the data used may not be clear.Source code, model weights, and training data are fully open and available.
How it is accessed?The access is through an API and relies on either a self-hosted or third-party vendor hosting with an SLA.Can be self-hosted or hosted by a vendor.
Cost/qualityThere is no single “best” model for every use case, as it varies.There is no single “best” model for every use case, as it varies.

How Databricks can address these challenges?

Databricks provides unique advantages for implementing LLMs

  1. LLMs can be easily accessed through interactive SQL, Delta Live Tables, and real-time APIs. Additionally, you can still use LLMs through Python IDE/notebook.
  2. Databricks supports both proprietary SaaS and open-source LLMs. With easy-to-use tools, you can train and fine-tune open-source LLMs to achieve the ideal balance of quality, control, and customization for each use case.
  3. Databricks provides secure integration with your enterprise data in the Lakehouse, eliminating the need to copy your data to another vendor or service. Additionally, the Databricks platform offers a unified experience with model serving, feature store, MLOps (LLMOps), and data monitoring capabilities.

What is Databricks Dolly?

Databricks Dolly is a large language model (LLM) that can follow natural language instructions and generate text, such as brainstorming, summarization, question answering, etc. It is based on an open-source 6 billion parameter model from EleutherAI and fine-tuned on a human-generated instruction dataset called databricks-dolly-15k. Dolly is open-source and licensed for commercial use, which means anyone can use it to create interactive applications without paying for API access or sharing data with third parties. Dolly is also cost-effective to build, as it can be trained for less than 30$.

When you generate an answer using Dolly, the data is stored in the DBFS root or other cloud object storage location that you configure. You can create, own, and customize your own LLM using Dolly without sharing data with third parties.

Databricks Dolly working

Some of the benefits of Databricks Dolly are:

  • Databricks Dolly provides users with greater transparency and control over their data. The chatbot is built on top of a secure and reliable platform, and users can review and modify the code to ensure that their data is being handled securely.
  • It is an open-source project, Databricks Dolly is available to the public for free, and users are encouraged to contribute to its development and improvement. This also means that anyone can use it to create interactive applications without paying for API access or sharing data with third parties
  • It is based on an open-source 12 billion parameter model from EleutherAI, which is much smaller than other LLMs like ChatGPT (ChatGPT 4 was trained on 100 Trillion parameters) but it can still exhibit surprising instruction-following capabilities.
  • It is trained on the Databricks machine learning platform, which provides a fast, simple, and scalable way to build and run large language models.

Evolution of Dolly

Currently, at the time of writing this blog, Dolly is running version 2. But when Dolly was released first time it had certain limitations. Let’s understand its evolution:

Dolly 1.0

Dolly 1.0 is based on an open-source 6 billion parameter model from EleutherAI and fine-tuned on a dataset of ~50k instruction/response pairs that were created by Stanford researchers using the OpenAI API. Dolly 1.0 was released in March 2023, but it was not licensed for commercial use, as the dataset contained output from ChatGPT, which is subject to OpenAI’s terms of service.

Dolly 2.0

Dolly 2.0 is based on an open-source 12 billion parameter model from EleutherAI and fine-tuned on a human-generated instruction dataset called data bricks-dolly-15k. Dolly 2.0 was released in April 2023, and it is licensed for both research and commercial use, which means anyone can use it to create interactive applications without paying for API access or sharing data with third parties. 

A world of caution while using ChatGPT

As a language model, ChatGPT has the ability to process and generate text based on the input provided to it. Depending on the input, it is possible that ChatGPT may generate text that contains sensitive information, such as personal details, financial information, or confidential business information. If the generated text contains such sensitive information and is not handled properly, there is a risk that it may be exposed to unauthorized individuals, such as other users of the chatbot, or hackers who may attempt to exploit any vulnerabilities in the system. This could potentially lead to the leakage of sensitive information about a particular organization or individual, which could cause significant harm to the affected parties. Therefore, it is important to handle sensitive information with care and to implement appropriate security measures to protect against unauthorized access or data breaches.

The Difference between Dolly vs ChatGPT

The below table illustrates the difference between Databricks Dolly and CharGPT.

FactorDatabricks Dolly 2.0ChatGPT-4
Language ModelEleutherAI pythia model family.Family of Model that are part of GPT series of Model
Parameters usedDatabricks Dolly supports many versions:
-dolly-v2-12b is 12 Billion peremeter model.

It is also available in smaller sizes:

-dolly-v2-7b, a 6.9 billion parameter based on pythia-6.9b
-dolly-v2-3b, a 2.8 billion parameter based on pythia-2.8b
100 Trillion
Security It can be deployed in the strict security perimeter of Organization so it is more secure as there are less chances of data exposed externally.There could be chances of Data breach if not handled properly.
PricingFreeFree but also have paid offering.
Open SourceYESNO

Types of Instructions Dolly Can Handle

Databricks Dolly can handle a number of tasks:

  1. Open Q&A: This task involves asking a question that may not have a clear answer or may require drawing on knowledge from a broad range of topics. For example, “Why do people enjoy music?” or “What is the meaning of life?”.
  2. Closed Q&A: In contrast to open Q&A, this task involves asking a question that can be answered using only the information contained in a given passage. For example, given a paragraph about the history of the Eiffel Tower, a closed question could be “When was the Eiffel Tower built?”.
  3. Extract information from Wikipedia: In this task, annotators are asked to copy a paragraph from Wikipedia and extract specific pieces of factual information, such as the date of a historical event or the name of a notable person mentioned in the passage. For example, given a paragraph about the American Civil War, annotators could be asked to extract the date of the Battle of Gettysburg.
  4. Summarize information from Wikipedia: Similar to the previous task, annotators are given a paragraph from Wikipedia, but are instead asked to summarize it into a shorter passage. For example, given a paragraph about the history of jazz music, annotators could be asked to summarize it into a few sentences describing the origins of the genre. Wikipedia is just an example but if you want to summarize data from other datasets (practically any dataset in the lakehouse ) you can use the summarization capability on it.
  5. Brainstorming: This task involves asking annotators to generate a list of possible options or ideas for a given topic. For example, “What are some ways to reduce plastic waste?” or “What are some fun activities to do with kids?”.
  6. Classification: In this task, annotators are asked to make judgments about the class membership of a given item or passage. For example, given a list of animals, annotators could be asked to classify each animal as a mammal, bird, reptile, etc. Alternatively, given a movie review, annotators could be asked to classify the sentiment of the review as positive, negative, or neutral.
  7. Creative writing: Finally, this task involves asking annotators to write a piece of creative writing, such as a poem, a short story, or a love letter. For example, annotators could be asked to write a short story about a character who wakes up one morning to find that they have been transported to a magical land.

How to use Databricks Dolly?

To use Dolly, first, you can copy this code and start using it. Let’s understand this code:

STEP 1: Ensure that both the transformers and accelerate libraries are installed. Use the following command to do so:

%pip install "accelerate>=0.16.0,<1" "transformers[torch]>=4.28.1,<5" "torch>=1.13.1,<2"

The command installs the following packages:

  1. accelerate>=0.16.0,<1: This package provides tools to help accelerate the training of machine learning models on GPUs. The >=0.16.0 specifies the minimum version required and <1 specifies the maximum version allowed.
  2. transformers[torch]>=4.28.1,<5: This package provides a collection of pre-trained models for NLP tasks, including those based on the GPT architecture. The [torch] specifies that the package should include the PyTorch version of the models. The >=4.28.1 specifies the minimum version required and <5 specifies the maximum version allowed.
  3. torch>=1.13.1,<2: This package is the PyTorch library, which is used for building and training deep learning models. The >=1.13.1 specifies the minimum version required and <2 specifies the maximum version allowed.

STEP 2: After loading the model, you can use the generate_text function to generate text based on the input you provide.

#imports the PyTorch library, which is a popular deep-learning framework used for building and training machine learning models, including models for natural language processing (NLP).
import torch
from transformers import pipeline

generate_text = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")

Let’s understand the code. The first line imports the PyTorch library, which is a popular deep-learning framework used for building and training machine-learning models, including models for natural language processing (NLP).

The second line imports the pipeline function from the Transformers library, which is a high-level interface for interacting with pre-trained models and performing various NLP tasks, such as text generation, sentiment analysis, and named entity recognition.

The third line defines a variable generate_text, which creates a pipeline for generating text using the Databricks Dolly model. The model parameter specifies the name of the pre-trained model to use, which in this case is "databricks/dolly-v2-12b". The torch_dtype the parameter specifies the data type to use for the PyTorch tensors used in the pipeline, which in this case is torch.bfloat16, a lower-precision floating point format that can help speed up computations on certain hardware.

The trust_remote_code parameter specifies whether to allow downloading and executing code from a remote server (in this case, the Databricks server hosting the model). The device_map parameter specifies the device(s) to use for running the model, which in this case is set to "auto", which automatically selects the best available device based on the hardware and configuration.

STEP 3: You can use the pipeline to answer instructions. In the example given, the input is “Explain to me the difference between nuclear fission and fusion.” Let’s understand the code:

res = generate_text("Explain to me the difference between nuclear fission and fusion.")

The generate_text the function takes the prompt as input and uses the language model to generate a text response. The generated response is stored in the res variable.

The second line of the code prints the generated text by accessing the generated_text field of the first item in the res list.

STEP 4: Alternatively, if you do not want to use ‘trust_remote_code=True', you can download instruct_pipeline.py and construct the pipeline yourself from the loaded model and tokenizer. The tokenizer and model can be loaded using the AutoTokenizer and AutoModelForCausalLM classes from the Transformers library. The constructed pipeline can then be used to generate text in the same way as before.

import torch
from instruct_pipeline import InstructionTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b", device_map="auto", torch_dtype=torch.bfloat16)

generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)

The code loads the pre-trained Databricks Dolly v2-12B model and tokenizer using the AutoModelForCausalLM and AutoTokenizer classes, respectively, from the Transformers library. The padding_side parameter is set to “left” to ensure that the generated text is left-padded.

Then, the InstructionTextGenerationPipeline class is used to create a new text generation pipeline, which takes in the pre-trained model and tokenizer as arguments. This pipeline will allow the user to provide instructions for generating text, which the model will use to generate text that is relevant to the instructions.

Tips and Best Practices for Using Dolly Effectively

Databricks Dolly is simple to use but still, there are many things that you have to keep in mind while using Dolly. Also, you should know the proper syntax and instruction required so that it will create and generate accurate results and meet your expectations. Below are some of the best practices and tips for using Databricks Dolly effectively.

1. Be clear and specific in the instructions: Dolly can follow natural language instructions, but they may not understand vague, ambiguous, or incomplete instructions. Try to provide enough details and context for Dolly to generate a relevant and coherent response. For example, instead of asking “Write a poem”, you can ask “Write something in spring”.

2. You should use the Databricks machine learning platform to train and deploy Dolly: Even though you can set up Databricks Dolly on your local machine. It is always recommended to use Databricks to deploy Dolly since it provides a fast, simple, and scalable way to build and manage data and ML pipelines. You can use the Databricks platform to fine-tune Dolly on your own data, integrate it with other tools and services, and monitor its performance and usage.

3. To interact with Dolly, use the Hugging Face library: Hugging Face is a popular library for natural language processing that supports many models and frameworks. You can use the Hugging Face library to load Dolly from the model hub, generate text with the pipeline function, or customize it with the Trainer class.

5. Use the examples and tutorials provided in the dolly repo: The dolly repo contains many examples and tutorials that show how to use Dolly for different tasks and domains. You can use these resources to learn from and get inspired by them.

These Best practices can be equally applied to any other LLM.


This article highlights the challenges organizations face in adopting Large Language Models (LLMs) and compares the benefits of using open-source LLMs versus proprietary SaaS LLMs. It also explains how Databricks, a unified data, and AI platform, can address these challenges and provides insights into its generative AI large language model, Dolly. With Dolly, organizations can develop a personalized LLM that suits their specific needs, with the added advantage of its source code being open-source. The article also emphasizes the importance of best practices to maximize LLM performance, and Databricks offers tips for using Dolly effectively. Moreover, this article mentions that Databricks Dolly democratizes LLMs by offering limitless possibilities for building and fine-tuning your own LLMs to fit an organization’s unique requirements.

+ There are no comments

Add yours

Leave a Reply