Are you tired of dealing with complex code and confusing commands when working with Apache Spark? Well, get ready to say goodbye to all that hassle! The English SDK for Spark is here to save the day.
With the English SDK, you don’t need to be a coding expert anymore. Say farewell to the technical jargon and endless configurations. Instead, use simple English instructions to communicate with Apache Spark.
In this article, we’ll explore how the English SDK makes data processing a breeze. We’ll learn how it generates code effortlessly and provide a step-by-step guide to using the English SDK for Spark. Get ready for a smoother, more straightforward data analysis experience!
Get excited as we take you on a journey through the world of the English SDK for Spark. Say hello to a new era of simplicity and accessibility in data analysis.
Table of Contents
Before we begin the blog, it’s essential to understand why we are using the English SDK for Apache Spark and what it can achieve. Let’s dive right in and not waste any time.
What is English SDK?
The English SDK for Apache Spark is a tool that allows you to write Spark applications in plain English. It takes English instructions and compiles them into PySpark objects like DataFrames. This makes Spark more user-friendly and accessible, allowing you to focus your efforts on extracting insights from your data.
The English SDK is based on OpenAI’s GPT language model, which has been trained on extensive amounts of text and code. With this model, the English SDK is capable of comprehending and responding to a diverse range of English commands. For instance, you can employ the English SDK to:
- Perform data ingestion: The SDK can perform a web search using your provided description, utilize the LLM to determine the most appropriate result, and then smoothly incorporate this chosen web data into Spark—all accomplished in a single step.
- Create DataFrames: The SDK can create DataFrames from a variety of sources, including CSV files, JSON files, and web APIs.
- Transform DataFrames: The SDK can perform a variety of transformations on DataFrames, such as filtering, sorting, and aggregation.It can also use Python Pandas DataFrames.
- Visualize data: The SDK can create visualizations of your data using libraries like Plotly Matplotlib and Seaborn.
How does English SDK work?
Consider the below diagram, you can observe the process flow of the English SDK, which utilizes a compiler to transform English instructions into bytecode. This bytecode is subsequently executed by the Apache Spark engine to carry out various operations, including DataFrame generation and DataFrame filtering.
- The “Source Code in English” represents the program written in plain English instead of using complex constructs of Pyspark API.
- The “Generative AI” from ChatGPT ‘s GPT 3.5 or GPT-4 Model processes this source code and converts it into PySpark conde
- This “PySpark” is compiled and then fed into the “Apache Spark Engine.” The engine interprets and executes the bytecode, performing various operations, including DataFrame generation, filtering, and other manipulations based on the provided English instructions.
The English SDK is still in the early stages of development, but it has the potential to revolutionize the way we interact with data. By making Spark more accessible to non-technical users, the English SDK can help to democratize data engineering and data science and make it more accessible to everyone.
How to use English SDK for Apache Spark?
To make the most of the English SDK’s capabilities, follow these important steps. Let’s explore them in detail.
Create OpenAI API Key
Follow the below steps to create an OpenAI API Key:
- Go to the OpenAI API and log into your account.
- Click on API.
- Go to your Account and click “Visit API Key“.
- Now, Click on “+ Create new secret key“.
5. Provide the name of the key.
- The OpenAI API Key will be created and after that copy that Key.
- Press the Done button.
Important Note: After creating the secret key, it is crucial to be aware that you cannot make any changes or edits to it.
Create Google API Key
Now, you have to create a Google API key, for that follow the below steps:
- Go to the Google Cloud Console and navigate to the Credential page.
- Click on “Create Project” and provide a name for the project along with the organization’s location. Click “Create” to set up the new project.
- Provide the name of the project.
- Click Create.
- Select the project name created above
- Select the navigation menu.
- Click API & Services.
- To enable API and Service we will follow the steps in the next section.
Custom search service needs to be enabled so for this we will follow these steps:
Enable Custom Search Service and create API Key and Credential
- Select Enabled API Services.
- Click the “+ Enable APIs and Services” sign.
- Search Custom search API.
- click Enable.
- Click Create Credentials.
- Select custom Search API from the drop-down if it is not already selected.
- Click Next
- Provide the App name.
- Provide the Developer’s email ID.
- Click save and continue
- Provide Application type from drop-down to web application.
- Provide name as web client.
- Click Create.
- Click Done
Now we will create an API key.
- Click the credential navigation menu.
- click + sign and select the API key.
This will create an API key and it will look like this:
API and credential Key will look like this.
Create Google Custom Search Engine ID
Here you have to create a search engine ID
- Browse Google CSE.
- Enter the name of the search engine. for example google
- Now, enter http://www.google.com under “What to Search?“.
- click add
- Click I am not Robot.
6. Now click Create.
Now copy the Google search engine ID as that will be used in the next step.
STEP 1: Install the English SDK
Install the English SDK using the following pip command:
%pip install pyspark-ai %pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib %pip install langchain %pip install openai
STEP 2: Import Necessary Modules.
from langchain.chat_models import ChatOpenAI from pyspark_ai import SparkAI import os
STEP 3: Set Environment variable for API Keys:
In this step, you have to import three API keys:
- OpenAI API Key
- Google API Key
- Google CSE ID
os.environ['OPENAI_API_KEY'] = 'Your Open AI Key' os.environ['GOOGLE_API_KEY'] = 'Your Google API Key' os.environ['GOOGLE_CSE_ID']='Custom search Engine ID '
STEP 4: Create an instance of the language model
llm = ChatOpenAI(model_name='gpt-3.5-turbo-16k-0613', temperature=0)
STEP 5: Initialize and Activate SparkAI
spark.ai = SparkAI(llm=llm, verbose=True) spark.ai.activate()
STEP 6: Create DataFrame from a given URL using the language model
auto_df = spark.ai.create_df("https://www.carpro.com/blog/full-year-2022-national-auto-sales-by-brand")
STEP 7: Display the DataFrame
Here is a complete code snippet that enables you to obtain the desired output:
#Import the necessary modules from langchain.chat_models import ChatOpenAI from pyspark_ai import SparkAI import os #Set the environment os.environ['OPENAI_API_KEY'] = 'Your Open AI Key' os.environ['GOOGLE_API_KEY'] = 'Your Google API Key' os.environ['GOOGLE_CSE_ID']='Custom search Engine ID ' #If 'gpt-4' is unavailable, use 'gpt-3.5-turbo' (might lower output quality) llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0) #Initialize SparkAI with the ChatOpenAI model spark.ai = SparkAI(llm=llm, verbose=True) # Activate partial functions for Spark DataFrame spark.ai.activate() auto_df = spark.ai.create_df("https://www.carpro.com/blog/full-year-2022-national-auto-sales-by-brand") display(auto_df)
Developers are encouraged to experiment with creating and implementing other LLMs (i.e. gpt3.5 or gp7-4), which can be passed during the initialization of SparkAI instances to cater to various use cases.
English SDK Operations
Here I have provided the commands which you can try.
# COMMAND ---------- auto_df2 = spark_ai.create_df("Top 10 tech companies by market cap") auto_df2.show() # COMMAND ---------- auto_df3 = spark_ai.create_df("2022 USA national auto sales by each brand") auto_df3.show() # COMMAND ---------- auto_df3.ai.verify("expect sales change percentage to be between -100 to 100") # COMMAND ---------- auto_df3.ai.plot() # COMMAND ---------- This command does not work and threw this error ValueError: DataFrame #constructor not properly called! auto_df3.ai.plot("pie chart for US sales market shares, show the top 5 brands") # COMMAND ---------- auto_top_growth_df=auto_df3.ai.transform("brand with the highest growth") auto_top_growth_df.show() # COMMAND ---------- # Explain what a DataFrame is retrieving. auto_top_growth_df.ai.explain() # COMMAND ---------- # You can also specify the expected columns for the ingestion. df=spark_ai.create_df("USA presidents", ["president", "vice_president"]) df.show() # COMMAND ---------- presidents_who_were_vp = df.ai.transform("presidents who were also vice presidents") presidents_who_were_vp.show() # COMMAND ---------- presidents_who_were_vp.ai.explain() # COMMAND -This command threw this error--AttributeError: 'DataFrame' object has no attribute 'isnull'------- presidents_who_were_vp.ai.verify("expect no NULL values") # COMMAND ---------- # Search and ingest web content into a DataFrame company_df=spark_ai.create_df("Top 10 tech companies by market cap", ['company', 'cap', 'country']) company_df.show() # COMMAND ---------- us_company_df=company_df.ai.transform("companies in USA") us_company_df.show() # COMMAND ---------- us_company_df.ai.explain() # COMMAND ---------- us_company_df.ai.verify("expect all company names to be unique") # COMMAND ---------- best_albums_df = spark_ai.create_df('https://time.com/6235186/best-albums-2022/', ["album", "artist", "year"]) best_albums_df.show() # COMMAND ---------- best_albums_df.ai.verify("expect each year to be 2022") # COMMAND ------This will create the UDF based on the plain english statement.---- @spark_ai.udf def convert_grades(grade_percent: float) -> str: """Convert the grade percent to a letter grade using standard cutoffs""" ... # COMMAND --------Testing the UDF-- from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() spark.udf.register("convert_grades", convert_grades) percentGrades = [(1, 97.8), (2, 72.3), (3, 81.2)] df = spark.createDataFrame(percentGrades, ["student_id", "grade_percent"]) df.selectExpr("student_id", "convert_grades(grade_percent)").show() # COMMAND -------To keep it in the cache.--- spark_ai.commit() # COMMAND ----------
English SDK for spark is still in development so some of the features still do not work and I have provided some of the examples above.
The English SDK for Apache Spark is a game-changer in the world of data processing. By enabling users to interact with Spark through simple English instructions, it eliminates the complexities of coding and empowers data analysts and developers alike to focus on extracting valuable insights from their data.