Unit 3 Data analytics
Area of Study 1: Data analytics
Outcome 1
Respond to teacher-provided solution requirements and designs to extract data from large repositories, manipulate and cleanse data and apply a range of functions to develop software solutions to present findings.
Examples of learning activities
- Investigate the techniques used by the Australian Bureau of Statistics to collect census data in 2016. In that year, there was a change of collection method from paper-based to online. Compare and contrast the techniques used.
-
Provide students with a range of data sets. Include data that is missing one characteristic of integrity: timeliness, authenticity, relevance, accuracy, reasonableness or correctness. For each example, discuss criteria that could be used to identify data that breaches these characteristics and methods that could be put in place to ensure the integrity of data.
- Identify and explore a large repository that contains each of the following data: census, Geographic Information System (GIS), sensor, social media and weather. Demonstrate a range of methods to extract data into different output formats (e.g. TXT, CSV, XLSX, and JSON). As students extract data from each large repository, they manually create a reference list in a document using the APA method.
- Using the following examples, demonstrate to students why specific data types are used to store data:
- telephone numbers are best stored as strings because characters such as parenthesis or spaces are usually required for storage and no additional processing is required
- sorting numeric values or numbers as text achieves a different outcome – 1, 3, 5, 15, 101 (numeric values) vs. 1, 101, 15, 3, 5 (numbers as text) – due to text sorting not taking into account place value.
- Demonstrate sample solution requirements to problems, needs or opportunities in a range of formats (list, case study, timeline for implementation, expected data input).
- Demonstrate the process, in a step-by-step example, of how to document solution requirements, constraints and scope for a small case study. Some steps that could be useful to follow are:
- asking questions of the users (what do you need?)
- observing and identifying what users are currently doing
- understanding mandatory changes that could be needed (legislation or crucial business need)
- knowing the systems and processes that will be impacted by any change, including manual.
Ensure that each requirement is able to be tested. Define the types of constraints, list examples of each type and explain how this would interfere with a project (schedule, cost, quality of solution). To determine scope, provide an example of narrowing the scope of a research project: e.g. wide (research all defensive teams to see if they are more successful in sports), gradual narrowing (research all basketball teams to see if a defensive playing style makes them more successful), further narrowing (research all NBA teams, then narrow research on all NBA teams in the last five years). - Create a folder of generically named files that contain meaningful content. Students use an appropriate naming convention to rename these files based on the file’s content.
- Use a case study to transform a flat file of data from a large repository into a third normal form relational database. Provide a sample of data from the final database plus a list of steps that were required to transform the data into this state.
Steps that could be useful for the students to follow:
- Interpret data requirements from case study.
- Normalise the database to 3NF.
- Using a RDBMS package, create database tables (in Microsoft Access using design view) including applying validation rules and input masks.
- Link tables using appropriate cardinality (in Microsoft Access this is completed in the Database Tools-Relationships view).
- Import data for each table (in Microsoft Access this is most easily achieved using the Import Wizard).
- Create a design for a database that includes relationships and keys (primary and foreign) for a range of scenarios, such as:
- a new and used cars sales business
- a database of music files for a DJ.
Provide students with sample data records to insert into a database, and an incomplete data dictionary. Students fill in the missing information with appropriate responses. This allows revision of the skills of naming conventions, data types and electronic validation. - Provide students with a short activity that enables them to demonstrate their abilities with spreadsheets. Discuss with students how using spreadsheet software can assist in fulfilling design principles for functionality, usability, accessibility and appearance.
- Provide students with a full RDBMS database with data imported, and a list of data requirements for extraction. Students manipulate the data by creating queries that meet the required data requirements. To allow extensive practice for students, these should include:
- sorting (order by)
- filtering (by criteria)
- summarising (group by…sum/average/max/min).
- Provide students with a spreadsheet with many rows and columns of data in order to practise manipulating and cleansing data. Students complete the following:
- Ensure each sheet maintains original data in one tab – copy this tab to make any changes.
- Practise applying a filter (using text and numeric criteria) and using advanced sorting to a spreadsheet with over 300 records.
- Explore different ways of analysing data using spreadsheet functions and formulas. This can include summaries (average, sum totals), measures of spread (quartiles, standard deviation, median).
- Change the way that data is viewed using the transpose function, hiding and removing rows or columns.
- Create a list that identifies the purpose of a range of data visualisations. A good resource to categorise visualisations is found on the home page of the Data Visualisation Catalogue. Discuss as a class where each of the data visualisations could be used in a real-world scenario.
- Discuss conventions that will be used for a printed or static (black and white or coloured) data visualisation of different types (e.g. maps, bar charts). The following site is a useful resource to help students understand design thinking in order to make their data visualisation clearer for intended readers: Data analysis and visualisation: Creating a visualisation
- Develop a testing table structure (test number, description of test, expected result, actual result) and have students design test cases that would help to ensure that:
- a database has integrity
- a spreadsheet has functions and formulas that work correctly
- a data visualisation is effective in presenting information.
- Students work in groups, each one researching one of the following businesses: Google, Microsoft, Apple, Netflix, Facebook, eBay, Amazon. Groups create a list of the most important data that would be collected by each business, and document the method and techniques they use to collect this data. In turn, groups describe how the data they collect allows the business to be profitable. After each group has shared their information with the class, students create a list summarising the main reasons why users supply, and organisations acquire, data online.
Detailed example
Identifying large data repositories
For this activity students will acquire data from a large repository. Explain to students the context of the World Bank World Development Indicators and how useful they are in understanding and analysing a range of economic measures. These assist experts to create policies to enhance world development.
Students can open a web browser and navigate to the World Bank’s databank.
The extraction of data from large repositories often requires the requestor providing structured instructions to receive the data they require. This links into how a simple Structured Query Language (SQL) statement can be created. Although knowledge of SQL is not examinable, brief knowledge of the terminology can be helpful to assist students understand the data extraction processes.
Students seeking to find data in large repositories will have to identify their database (part of a SQL statement’s FROM condition), specify fields required from a large list (part of a SQL statement’s SELECT condition) and only extract data that meets a specific set of criteria (part of a SQL statement’s WHERE condition). A template statement would be: SELECT fields FROM database WHERE conditions.
In the web reference, students use the data generator tool to specify the variables they require (on the left-hand side of the screen). They follow the steps below:
- Use the World Development Indicators database (FROM) and keep this option selected in the Databases drop down.
- Look for specific indicators or series to view (SELECT). This is where options need to be selected, for example GDP growth, unemployment rate, population, etc.
- Enter a specific list of countries (WHERE) with which to scope the study. Countries and aggregate areas are available in this chart. Individual countries are generally best selected for comparison purposes.
- Enter a specific list of points in time (WHERE). Appropriate scope needs to be considered for the amount of data to be analysed and displayed.
Students can be informed that the extraction tool uses Boolean and logic to find only data that exists for fields that satisfy both the requirements of country and dates. For example, if GDP data from Afghanistan from 2010–2018 is selected, only the data that meets all of these requirements is extracted.
Demonstrate to students that there are different formats of files for export that are generated by the database such as Excel, CSV and Tabbed Text. The data generated also includes referencing information that can be very useful when considering how reliable the data is. Each data source contains a reference from where it was collected.
Give students examples to interpret with the data generator. Provide indicators, countries and points in time. Students should extract data into each of the three format options.
Students should be made aware that they can use this method of expressing data collected from a repository. Students should be aware of methods of documenting data requirements because they will need to interpret these as part of Unit 3 Outcome 1 and they need to create requirements as part of their SAT (Unit 3 Outcome 2).
As an extension activity, the data stored at the World Bank lends itself to data analytics of a range of very interesting data points over a period of time. Students can create analytics charts (scatter, combo) comparing the impact of variables such as literacy rates, mortality rates and unemployment rates when overlaid with GDP data.