What data source did we use?

The news articles about South Africa water-related issues are extracted from two datasets from the GDELT project. GDELT maintains a database of the world's news media in different languages.

The articles data are extracted from GDELT DOC 2.0 API. This dataset includes basic information about news articles such as its URL, author, title, and publication date.

The knowledge graph data are extracted from GDELT Global Knowledge Graph 2.0. The database is updated daily. The knowledge graph data connects every person, organization, location, count, theme, news source, and event across the planet into a single network. We are using the location and theme data from this dataset.

We used the newsfeed library to extract the Knowledge graph data and used the gdelt-doc-api library to extract news articles data with Python.

DOC 2.0 API data (Articles data)

Extraction

We extracted doc 2.0 API data which satisfies the following two criteria:

The source of the article is a South African news outlet.

The article contains at least one of the themes on our selected theme list listed below. We selected the following 15 themes related to water supply, sanitation and water security because there is a limit of one API request every 5 seconds.
WB 2007 WATER SAFETY PLANS,
TAX DISEASE WATERBORNE ILLNESS,
WB 991 DRAINAGE WATER QUALITY,
WB 1995 ENERGY FOR WATER AND WASTEWATER TREATMENT,
WB 2978 WATER POLLUTION LOAD,
WB 1462 WATER SANITATION AND HYGIENE,
WATER SECURITY,
WB 1199 WATER SUPPLY AND SANITATION,
WB 139 SANITATION AND WASTEWATER,
WB 138 WATER SUPPLY,
WB 2008 WATER TREATMENT,
WB 1798 WATER POLLUTION,
WB 1063 WATER ALLOCATION AND WATER SUPPLY,
WB 149 WASTEWATER TREATMENT AND DISPOSAL,
WB 2981 DRINKING WATER QUALITY STANDARDS

Result

The news articles data extracted from GDELT doc2.0 API for the period fromJan 1, 2017 to Apr 14, 2022 includes the following information

URL

Title

Publish date

Image url

Domain of the news organization

Knowledge Graph data

Extraction

We extracted knowledge graph data which satisfy the following two criteria:

The source of the news article is from a South African media outlet. GDELT maintains a dataset of media in each country.

Instead of focusing on dozens of water-related themes like in the article database, we extracted 191 themes of interest to us related to water, natural disaster, climate change, and other relevant topics. We chose a broad range of themes for archiving purposes. The extraction process is relatively fast and yields data, so we could easily filter the information we extracted.

Result

The data extracted from GDELT knowledge graph dataset includes the following information:

Date

Themes of the article

Locations mentioned in the article

Person names mentioned in the article

Organizations mentioned the in the article

Domain of the news organization

URL of the article

We didn’t use the person names and organization names in the current data repository. Because of the fact that organization names are usually made of multiple words, it would require us to use a different methodology to implement the search functionality on these names, which we haven’t worked out currently. WHY NOT?

Data Processing and Filtering

Filter by content

We found many articles coded with water-related themes in the GDELT database that were not focused on water sanitation, water supply or water security. The article might contain relevant terms, but was not about water issues. We needed to filter out these irrelevant articles. It’s hard to do that manually because of the volume of articles. As a result, we obtained further information for these articles and then filtered them with OpenAI.

Extracting article summaries and keywords

We used the Full-text downloader (built with newspaper3k and Wayback Machine) from the newsfeed library to extract more detailed information for each article.

These information includes the following elements:

Article Full text
Article summary
Keywords of each article
Authors of each article

Letting OpenAI decide whether this is an article of our interest

We categorized the articles from the two datasets using the criteria listed below.Then we gave the classification task to OpenAI:

Where the article summary includes the word “Water” in its summary
Where the article summary doesn’t include the word “Water” in its summary

OpenAI charges by word fragments that they call “tokens.” Using title for the first round of classification for articles without the word “water” in its summary helped save on the cost.

Filter by location

Although all of the articles were sourced from South Africa, the collection of articles included reports from all over the world. Since we are only interested in news articles about Africa, we needed to filter out only articles about Africa.

Because most articles have multiple related locations, we needed to decide the country where the article is based. We used the most frequently mentioned location in the Location column to determine this. We created a country label for each article based on this process. Where several locations are mentioned the same times, we put “Multiple” as the country label.

Select articles with an African country label or “Multiple” label, and filter out all other articles

This narrowed the collection of stories to about 5,000 water-related articles focused on African countries.

Adding OpenUp dataset

We added a dataset Open Up had collected of 800 water-related articles in Africa that included keywords and country. This led to a total of about 6,000 articles.

Creating keywords for summary index

Location and theme columns are broken down into single words
These single words are concatenated into one column called “Summary” to make up the search index for our search engine.

Creating Article tags based on the themes

We manually filtered out water-related themes of all the themes of the articles we selected
This allowed us to identify the most frequently occurring themes in the dataset. We further manually brought similar themes together.

We used this list of 10 water-related themes as tags for each article.

Water Sanitation
Water Security
Water Supply
Wastewater
Water Treatment
Natural Disaster
Urban Water
Waterborne Diseases
Rural Water
Water Pricing

WATER WORLD

Methodology

What data source did we use?

How did we extract the data from the data source?

DOC 2.0 API data (Articles data)

Extraction

Result

Knowledge Graph data

Extraction

Result

Merged Articles and Knowledge Graph Data

Data Processing and Filtering

Filter by content

Extracting article summaries and keywords

Letting OpenAI decide whether this is an article of our interest

Filter by location

Adding OpenUp dataset

Creating keywords for summary index

Creating Article tags based on the themes

WATER
WORLD

Methodology

What data source did we use?

How did we extract the data from the data source?

DOC 2.0 API data (Articles data)

Extraction

Result

Knowledge Graph data

Extraction

Result

Merged Articles and Knowledge Graph Data

Data Processing and Filtering

Filter by content

Extracting article summaries and keywords

Letting OpenAI decide whether this is an article of our interest

Filter by location

Adding OpenUp dataset

Creating keywords for summary index

Creating Article tags based on the themes

WATERWORLD

WATER
WORLD