Summarizing website content using OpenAI's GPT-3 language model

I wanted to implement the newly released GPT-3 language model to summarize medical article content from various pages for the purposes of indexing. Due to the fact that I'm not fully comfortable programming in Python and I'm no longer a developer, I decided to try to use GPT-3 to write the code for me. As a trial, I decided to start by asking the GPT-3 model to write the python code necessary to accomplish a test in two different parts:

  1. Scrape a web page's content into a variable.

  2. Send a string into the GPT-3 API and ask it to summarize.


I used google's colab workspace to execute and test the code. Surprisingly, after copying the code output from the first prompt from GPT-3's website into google colab and running, it executed as expected. The code scrapes the website, finding everything within <p> tags and prints out the string. Without correction, the code from the second prompt also runs in google colab with a basic test string. I then manually combined the two into one script to take any URL and try to summarize. I didn't have to do much adjustment.

If you care about the code, here it is:

# install openai to execute the call to GPT-3

!pip install openai


# import packages we need for both GPT-3, scraping the website, and the data model for sending/receiving

import requests

from bs4 import BeautifulSoup

import json

import openai


# Set my GPT-3 API Key

openai.api_key = "<<GET THIS FROM YOUR OWN OPENAI ACCOUNT>>"


#Set a test URL

url = "https://www.jadler.info"


# Get the site content and print it out to see how it looks

response = requests.get(url)


soup = BeautifulSoup(response.text, "html.parser")


# Insert text for what I want the GPT-3 language model to do at the beginning of the string

curratedresponse = "Summarize this article: "


# Combine all of the p tags from the scraped page and add them to the string with the prompt

for p in soup.find_all("p"):

curratedresponse = curratedresponse + p.text


# Print the full article text just to see what it scraped

print("FULL TEXT:")

print(curratedresponse)


# Send the prompt and text into Open AI's GPT-3 language model using their most advanced language model "davinci 2"

summary = openai.Completion.create(

engine="text-davinci-002",

prompt=curratedresponse,

temperature=0.9,

max_tokens=400,

top_p=1,

frequency_penalty=0,

presence_penalty=0

)


# Print the summary

print("SUMMARY:")

print(summary['choices'][0]['text'])



For the homepage of my personal website, GPT-3 responded with the following summary:


"SUMMARY:

Jake Adler is an entrepreneurial technologist with over fifteen years of hands-on experience managing long-term technology roadmaps, leading cross functional teams, and creating best in class digital experiences. He believes that the primary driver behind perpetually increasing human standard of living is technology. He has dedicated his career to practically and creatively applying modern technology solutions to real world problems."


That worked as expected. Now we can try it on something that has a real world use case. ASSH has an open access journal that is hosted by a scientific publishing organization. I could scrape the article content and summarize it for our search. I decided to try https://www.jhsgo.org/article/S2589-5141(22)00055-X/fulltext.


I first noticed that the article content wasn't within p tags, so I had to adjust the code for scraping the content to the following based on the structure of the pages:

# Combine all of the p tags from the scraped page and add them to the string with the prompt - Changed to find the class section-paragraph instead of <p> tags

for p in soup.find_all(class_="section-paragraph"):

curratedresponse = curratedresponse + p.text


I then noticed that the summary was overcomplicated, so I decided to change the prompt. Obviously, depending on the use case, the prompt will be significantly different.

# Insert text for what I want the GPT-3 language model to do at the beginning of the string

curratedresponse = "Summarize this article for a child: "


Here is the summary for that article:

SUMMARY:

The article is about a surgeon who has refined his technique over 37 years and shares the most effective things he says to patients during surgery to decrease complications and improve outcomes. He has found that speaking to patients while they are awake during surgery is the best opportunity for patients to receive and remember instructions to improve their post-operative healing process.


If I change the prompt to "Summarize this article for a high school student, here is the summary:

SUMMARY:

The intraoperative conversation is an important part of a surgeon’s job, and this article provides step-by-step things that a surgeon can say to a wide-awake patient during the surgery to improve outcomes in clinical practice.


With barely any coding, I was able to get comprehensible and accurate summaries of a scientific articles.