Data Screening Assignment

50 points

Data screening is the beginning point of working within the data once they have been collected. It has been said by some data analysts that screening and cleaning the dataset is often 70% of the task of conducting the statistical analysis. It is that important!

Before you begin this assignment, you should have already completed the Quiz: Pick Topic assignment. If you have not, stop now and go do it. You cannot complete this assignment without having done so. Go ahead…I’ll wait.

Now that you have completed the Quiz: Pick Topic assignment, you have selected a topic that is specific for this course. Yes, I will continue to emphasize that point because choosing a topic that is not aligned with the course needs is the number one issue I see in this course. As part of that assignment, you selected four variables: two categorical and two continuous. The next task is to screen the data for those variables.

Data screening is a review of the data prior to conducting statistical analyses. When conducting data screening it is important to clearly understand the level of measurement for each of your variables, as the level of measurement will determine the way in which the data will be evaluated.

This assignment asks you to complete data screening for the four variables you selected in the Quiz: Pick Topic assignment. In this assignment, you will complete the following screening tasks:

  • Create frequency tables for each of the four variables
  • Create histograms for the two continuous variables
  • Create boxplots for the two continuous variables using categories from one of the categorical variables
  • Create bar graphs for the two categorical variables
  • Create scatterplots using the two continuous variables

I also suggest a bonus screening task, which is to create a crosstab for the two categorical variables.

Below I demonstrate how to accomplish each of these activities.

Tip: Pay careful attention to the variable level of measurement (i.e., categorical or continuous) for each screening tool. SPSS will allow you to accomplish the test you are doing, even with the incorrect variables type; however, the results will be meaningless. Including results that have no meaning in your data screening assignment demonstrate one does not understand the task.

Frequency tables

Frequency tables may be created for either categorical or continuous variables, although they are mostly used for categorical variables. Frequency tables provide information about the number of times a particular category or value appears in the dataset and can be useful for determining if outliers are present. The video below demonstrates how to develop frequency tables for both categorical and continuous variables.

 

Histograms

Histograms describe the distribution of continuous variables. As such, this test is only used with continuous variables. The example in the video develops histograms for the DASS-Depression and IPIP-Neuroticism variables.

 

Box plots

Box plots are useful for examining the distribution of a continuous variable within categorical variable groups. When planning to complete a box plot, one must select one categorical variable–the video example uses gender. The box plot then demonstrates the distribution of the variables DASS-Depression and IPIP-Neuroticism–each of which is a continuous variable–between the groups of gender.

 

Bar charts

Bar charts are useful for comparing the frequencies of independent variable groups. The video demonstrates how to produce bar charts for the two categorical variables of the study, which are gender and race.

 

Scatter plots

Scatter plots describe the relationship between two continuous variables. They are useful to determine if the variables have a linear correlation and for visually determining the  approximate strength and direction of the relationship. The video demonstrates how to create a scatter plot using the two continuous variables for the example study, which are DASS-Depression and IPIP-Neuroticism.

 

Crosstabs

Cross tabulation tables, or crosstabs, are useful for understanding the makeup of a dataset. The crosstab compares two categorical variables and counts the frequency of the two variable intersections. Using the categorical variables of gender and race, the video demonstrates how these two variables intersect, showing how many White males, African American Males, While females, African American females, and other combinations of gender and race make up the dataset.

 

Writing the results
The primary medium of communication amongst scholars is writing. Therefore, presenting the data screening approaches used and the results of data screening is a critical skill.

The introductory sentence should state, in general terms, what was done and the should name the variables. For example,

Data screening was accomplished for the variables gender, race, DASS-Depression, and IPIP-Neuroticism from the EDCO 745 course dataset.

The next sentences will describe the data screening tasks and any notable results from each.

Frequency tables were created for gender, race, DASS-Depression, and IPIP-Neuroticism (see Tables 1-4). Results of the frequency tables indicated slightly more male (n = 704) than female (n = 596) participants (Table 1). Visual inspection of histograms indicated a normal distribution for IPIP-Neuroticism (see Figure 1); however, DASS-Depression was not normally distributed, with a large number of participants having low scores, indicating few symptoms of depression (see Figure 2). Box plots were developed for the groups of gender to demonstrate the distribution of both DASS-Depression and IPIP-Neuroticism (See Figures 3-4). There were six outliers of IPIP-Neuroticism and no outliers of DASS-Depression between the gender groups. Additional screening included bar charts to allow for visual comparison of groups for gender (see Figure 5), and scatter plots to examine for a relationship between DASS-Depression and IPIP-Neuroticism, which visually indicated a positive correlation between the two (see Figure 6).

Note that within the write up, each line references the table or figure that supports the statement being made. This also requires one to correctly label each of the tables or figures prior to submitting the assignment. Also, please note in the sample write-up that the tables are correctly formatted according to APA and they are not directly copied from SPSS, which are not in APA format.

Submitting the assignment
When submitting this assignment, you must first describe the data screening assignment. The write-up should be descriptive of the variables and the activities used to screen the data, along with a description of the results. All submissions must be a single Microsoft Word document. Do not submit the SPSS file.