On completion of this unit the student should be able to interpret teacher-provided solution requirements and designs, extract data from large repositories, manipulate and cleanse data, conduct statistical analysis and develop data visualisations to display findings.
Detailed example
Based on one of the learning activities (KK6)
Extracting data from an online repository using SQL
For this activity students will acquire data from a large repository. Explain to students the context of the World Bank World Development Indicators and how useful they are in understanding and analysing a range of economic measures. These assist experts to create policies to enhance world development.
Students can open a web browser and navigate to the World Bank’s
databank.
The extraction of data from large repositories often requires the requestor providing structured instructions to receive the data they require. This links into how a Structured Query Language (SQL) statement can be created. This terminology is helpful to assist students understand the data extraction processes.
Students seeking to find data in large repositories will have to identify their database (part of an SQL statement’s FROM condition), specify fields required from a large list (part of an SQL statement’s SELECT condition) and only extract data that meets a specific set of criteria (part of an SQL statement’s WHERE condition). A template statement would be: SELECT fields FROM database WHERE conditions.
In the web reference, students use the data generator tool to specify the variables they require (on the left-hand side of the screen). They follow the steps below:
- Use the World Development Indicators database (FROM) and keep this option selected in the Databases drop down.
- Look for specific indicators or series to view (SELECT). This is where options need to be selected, for example GDP growth, unemployment rate, population, etc.
- Enter a specific list of countries (WHERE) with which to scope the study. Countries and aggregate areas are available in this chart. Individual countries are generally best selected for comparison purposes.
- Enter a specific list of points in time (WHERE). Appropriate scope needs to be considered for the amount of data to be analysed and displayed.
Students can be informed that the extraction tool uses Boolean and logic to find only data that exists for fields that satisfy both the requirements of country and dates. For example, if GDP data from Afghanistan from 2014–2021 is selected, only the data that meets these requirements is extracted.
Demonstrate to students that there are different formats of files for export that are generated by the database such as Excel, CSV and Tabbed Text. The data generated also includes referencing information that can be very useful when considering how reliable the data is. Each data source contains a reference from where it was collected.
Give students examples to interpret with the data generator. Provide indicators, countries and points in time. Allow students to increase in complexity the requirements of their queries.
For example:
- selecting the correct data (SELECT statement) –
in this case Afghanistan but could include different or additional Country or Area of the world values
- identifying the correct data source (FROM statement) –
in this case World Development Indicators but there are 85 data sources available for use in the World Bank Database
- using criteria to filter data (WHERE statement) –
in this case 2014–2021 but could have additional years added where values exist
- using multiple criteria with Boolean operators (AND/NOT/OR) –
in this case not used but could allow a comparison to be made between values in the 1980s by creating a condition that reads: 1980 > year > 1989 OR 2014 > year > 2021.
Students should extract data into each of the three format options.
Data can be sorted by one or multiple fields in a database, in either ascending or descending order. When this occurs in a database, an ORDER BY statement would be used.
Students should be made aware that they can use this method of expressing data collected from a repository. Students should be aware of methods of documenting data requirements because they will need to interpret these as part of Unit 3 Outcome 1 and they need to create requirements as part of their SAT (Unit 3 Outcome 2).
The data stored at the World Bank lends itself to data analytics of a range of very interesting data points over a period of time. Students can create charts (scatter, combo) that compare the impact of variables such as literacy rates, mortality rates and unemployment rates when overlaid with GDP data.
When sets of values are extracted from multiple database tables this can be completed using an INNER JOIN statement. A linking (primary/foreign) key relationship is determined between the database tables.
Consider an example where a database table (tblUnemployment) has the unemployment rate, the year and the country, and another database table (tblGDP) has GDP growth, the year and the country. To extract the unemployment rate, GDP growth, the year and the country (with some filtering and ordering of values), this would be a statement that can be used:
SELECT U.country, U.stats_year, U.unemployment_rate, G.gdp_growth
FROM tblUnemployment U
INNER JOIN tblGDP G on U.country = G.country AND U.stats_year = G.stats_year
WHERE U.stats_year > 2020
ORDER BY U.country DESC;
To practise using SQL, consider the use of an online SQL editor such as
sqliteonline .
With two different values (in this case GDP and unemployment) that can be compared, it allows the opportunity to analyse correlations. This is best completed within a spreadsheet once the data has been extracted.