Advertisement
  1. SEJ
  2.  ⋅ 
  3. SEO

An Introduction to Natural Language Processing with Python for SEOs

Want to prepare for the future of search? Learn practical natural language processing (NLP) while building a simple knowledge graph from scratch.

An Introduction to Natural Language Processing with Python for SEOs

Natural language processing (NLP) is becoming more important than ever for SEO professionals.

It is crucial to start building the skills that will prepare you for all the amazing changes happening around us.

Hopefully, this column will motivate you to get started!

We are going to learn practical NLP while building a simple knowledge graph from scratch.

As Google, Bing, and other search engines use Knowledge Graphs to encode knowledge and enrich search results, what better way to learn about them than to build one?

Specifically, we are going to extract useful facts automatically from Search Engine Journal XML sitemaps.

In order to do this and keep things simple and fast, we will pull article headlines from the URLs in the XML sitemaps.

We will extract named entities and their relationships from the headlines.

Finally, we will build a powerful knowledge graph and visualize the most popular relationships.

In the example below the relationship is “launches.”

The way to read the graph is to follow the direction of the arrows: subject “launches” object.

For example:

  • “Bing launches 19 tracking map”, which is likely “Bing launches covid-19 tracking map.”
  • Another is “Snapchat launches ads certification program.”

These facts and over a thousand more were extracted and grouped automatically!

Let’s get in on the fun.

Here is the technical plan:

  • We will fetch all Search Engine Journal XML sitemaps.
  • We will parse the URLs to extract the headlines from the slugs.
  • We will extract entity pairs from the headlines.
  • We will extract the corresponding relationships.
  • We will build a knowledge graph and create a simple form in Colab to visualize the relationships we are interested in.

Fetching All Search Engine Journal XML Sitemaps

I recently had an enlightening conversation with Elias Dabbas from The Media Supermarket and learned about his wonderful Python library for marketers: advertools.

Some of my old Search Engine Journal articles are not working with the newer library versions. He gave me a good idea.

If I print the versions of third-party libraries now, it would be easy to get the code to work in the future.

I would just need to install the versions that worked when they fail. 🤓

%%capture

!pip install advertools
import advertools as adv

print(adv.__version__)

#0.10.6

We are going to download all Search Engine Journal sitemaps to a pandas data frame with two lines of code.

sitemap_url = "https://www.searchenginejournal.com/sitemap_index.xml" 
df= adv.sitemap_to_df(sitemap_url)

One cool feature in the package is that it downloaded all the linked sitemaps in the index and we get a nice data frame.

Look how simple it is to filter articles/pages from this year. We have 1,550 articles.

df[df["lastmod"] > '2020-01-01']

Extract Headlines From the URLs

The advertools library has a function to break URLs within the data frame, but let’s do it manually to get familiar with the process.

from urllib.parse import urlparse

import re
example_url="https://www.searchenginejournal.com/google-be-careful-relying-on-3rd-parties-to-render-website-content/376547/"

u = urlparse(example_url)
print(u) #output -> ParseResult(scheme='https', netloc='www.searchenginejournal.com', path='/google-be-careful-relying-on-3rd-parties-to-render-website-content/376547/', params='', query='', fragment='')

Here we get a named tuple, ParseResult, with a breakdown of the URL components.

We are interested in the path.

We are going to use a simple regex to split it by / and – characters

slug = re.split("[/-]", u.path)

print(slug)
#output

['', 'google', 'be', 'careful', 'relying', 'on', '3rd', 'parties', 'to', 'render', 'website', 'content', '376547', '']

Next, we can convert it back to a string.

headline = " ".join(slug)

print(headline)

#output

' google be careful relying on 3rd parties to render website content 376547 '

The slugs contain a page identifier that is useless for us. We will remove with a regex.

headline = re.sub("\d{6}", "",headline)

print(headline)
#output

' google be careful relying on 3rd parties to render website content '


#Strip whitespace at the borders

headline = headline.strip()
print(headline)
#output

'google be careful relying on 3rd parties to render website content'

Now that we tested this, we can convert this code to a function and create a new column in our data frame.

def get_headline(url):

  u = urlparse(url)

  if len(u.path) > 1:

    slug = re.split("[/-]", u.path)

    new_headline = re.sub("\d{6}", ""," ".join(slug)).strip()

    #skip author and category pages

    if not re.match("author|category", new_headline):

      return new_headline

  return ""

Let’s create a new column named headline.

new_df["headline"] = new_df["url"].apply(lambda x: get_headline(x))

Extracting Named Entities

Let’s explore and visualize the entities in our headlines corpus.

First, we combine them into a single text document.

import spacy

from spacy import displacy

text = "\n".join([x for x in new_df["headline"].tolist() if len(x) > 0])

nlp = spacy.load("en_core_web_sm")

doc = nlp(text)

displacy.render(doc, style="ent", jupyter=True)


We can see some entities correctly labeled and some incorrectly labeled like Hulu as a person.

There are also several missed like Facebook and Google Display Network.

spaCy‘s out of the box NER is not perfect and generally needs training with custom data to improve detection, but this is good enough to illustrate the concepts in this tutorial.

Building a Knowledge Graph

Now, we get to the exciting part.

Let’s start by evaluating the grammatical relationships between the words in each sentence.

We do this by printing the syntactic dependency of the entities.

for tok in doc[:100]:

  print(tok.text, "...", tok.dep_)

We are looking for subjects and objects connected by a relationship.

We will use spaCy’s rule-based parser to extract subjects and objects from the headlines.

The rule can be something like this:

Extract the subject/object along with its modifiers, compound words and also extract the punctuation marks between them.

Let’s first import the libraries that we will need.

from spacy.matcher import Matcher

from spacy.tokens import Span

import networkx as nx

import matplotlib.pyplot as plt

from tqdm import tqdm

To build a knowledge graph, the most important things are the nodes and the edges between them.

The main idea is to go through each sentence and build two lists. One with the entity pairs and another with the corresponding relationships.

We are going to borrow a couple of functions created by Data Scientist, Prateek Joshi.

  • The first one, get_entities extracts the main entities and associated attributes.
  • The second one, get_relations extracts the corresponding relationships between entities.

Let’s test them on 100 sentences and see what the output looks like. I added len(x) > 0 to skip empty lines.

for t in [x for x in new_df["headline"].tolist() if len(x) > 0][:100]:

  print(get_entities(t))

Many extractions are missing elements or are not great, but as we have so many headlines, we should be able to extract useful facts anyways.

Now, let’s build the graph.

entity_pairs = []

for i in tqdm([x for x in new_df["headline"].tolist() if len(x) > 0]):

  entity_pairs.append(get_entities(i))

Here are some example pairs.

entity_pairs[10:20]
#output

[['chrome', ''], ['google assistant', '500 million 500 users'], ['', ''], ['seo metrics', 'how them'], ['google optimization', ''], ['twitter', 'new explore tab'], ['b2b', 'greg finn podcast'], ['instagram user growth', 'lower levels'], ['', ''], ['', 'advertiser']]

Next, let’s build the corresponding relationships. Our hypothesis is that the predicate is actually the main verb in a sentence.

relations = [get_relation(i) for i in tqdm([x for x in new_df["headline"].tolist() if len(x) > 0])]
print(relations[10:20])

#output
['blocker', 'has', 'conversions', 'reports', 'ppc', 'rolls', 'paid', 'drops to lower', 'marketers', 'facebook']

Next, let’s rank the relationships.

pd.Series(relations).value_counts()[4:50]

Finally, let’s build the knowledge graph.

# extract subject

source = [i[0] for i in entity_pairs]

# extract object

target = [i[1] for i in entity_pairs]

kg_df = pd.DataFrame({'source':source, 'target':target, 'edge':relations})
# create a directed-graph from a dataframe

G=nx.from_pandas_edgelist(kg_df, "source", "target",

edge_attr=True, create_using=nx.MultiDiGraph())
plt.figure(figsize=(12,12))

pos = nx.spring_layout(G)

nx.draw(G, with_labels=True, node_color='skyblue', edge_cmap=plt.cm.Blues, pos = pos)

plt.show()

This plots a monster graph, which, while impressive, is not particularly useful.

Let’s try again, but take only one relationship at a time.

In order to do this, we will create a function that we pass the relationship text as input (bolded text).

def display_graph(relation):

  G=nx.from_pandas_edgelist(kg_df[kg_df['edge']==relation], "source", "target", 

                            edge_attr=True, create_using=nx.MultiDiGraph())

  plt.figure(figsize=(12,12))

  pos = nx.spring_layout(G, k = 0.5) # k regulates the distance between nodes

  nx.draw(G, with_labels=True, node_color='skyblue', node_size=1500, edge_cmap=plt.cm.Blues, pos = pos)

  plt.show()

Now, when I run display_graph("launches"), I get the graph at the beginning of the article.

Here are a few more relationships that I plotted.

I created a Colab notebook with all the steps in this article and at the end, you will find a nice form with many more relationships to check out.

Just run all the code, click on the pulldown selector and click on the play button to see the graph.

Resources to Learn More & Community Projects

Here are some resources that I found useful while putting this tutorial together.

I asked my follower to share the Python projects and excited to see how many creative ideas coming to life from the community! 🐍🔥

More Resources:


Image Credits

All screenshots taken by author, August 2020

Category SEO Tools
ADVERTISEMENT
VIP CONTRIBUTOR Hamlet Batista CEO at RankSense

Hamlet Batista is CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He holds US ...