Monday, April 24, 2017

Assignment 5: Correlation and Autocorrelation

Introduction:

In this assignment, both correlations and spatial autocorrelations are found using statistical software and practice with interpretation is made. Microsoft Excel's scatter plot function is used to show variance in variables being examined, IMB SPSS's bivariate correlation function is used to get a correlation matrix (with Pearson correlation r values), and census data is downloaded, manipulated with Excel and ArcMap, and then used in GeoDa to find spatial autocorrelation (Moran's I) and then create a cluster map.

Correlation is best described as how well a dataset matches a best fit line. In other words it is how well one variable goes up or down when the other variable increases. When considering two different variables that are both available for a set of data points, the Pearson correlation coefficient or the r sample correlation coefficient value (the equation for which is in Figure 1) speaks to the relationship between these variables. A coefficient close to one shows a very strong positive correlation, while a coefficient close to negative one shows a strong negative correlation. A correlation that is close to zero on both the negative or positive side shows a weak or non-existent relationship. This is called a null relationship.

Figure 1


If there is a correlation this does not imply causation. There do exist spurious relationships between variables. In these cases another factor may be causing both of the variables to change, or the relationship and pattern of causation can be even more complicated.

Significance testing can also be done to find the significance of a found relationship coefficient between two variables. This is found automatically in the IBM SPSS statistics program however the route to finding significance is through hypothesis testing. In this case the null hypothesis is that p (the population correlation coefficient) = 0 and the alternative hypothesis is that p ≠ 0. The methods for hypothesis testing are the same, with t-testing or z-testing needing to be decided upon, the significance level decided upon, and the critical value or values (in case of a two-tailed test) needing to be found before finding running the test (shown in Figure 2) and comparing it to the critical values to find if the relationship is significant.



Spatial autocorrelation classifies areas surrounded by other areas into high high, high low, low high, or low low. High high is a classification for areas that have a high value of the variable being looked at and are also surrounded by areas with a high value. High low areas are areas with a high value but have areas surrounding with low values. Low High is the opposite of this last category. Low low is the categorization given to areas with a low value that are also surrounded by low values. When this classification scheme is mapped it is termed a cluster map because it shows the clusters of high values and low values. Normally patterns emerge spatially because things more alike tend to be close to one another. This means that usually there are a lot of high high classified areas, and a lot of low low classified areas, but not many low highs or high lows. The Moran's I is a spatial correlation statistic that describes this clustering. An I that is closer to one means more clustering. An I value that is closer to negative one means less clustering.


Methods:

Pearson's Correlations Using IBM SPSS:

Using a dataset provided by the instructor, data was processed with the IBM SPSS software to find the bivariate Pearson correlation coefficient for each column in relation to every other column. Before this happened the data was brought into Excel and scatter plots of each column of data were made. The data used was of census tract population information in seven categories: manufacturing employees, retail employees, finance employees, white population, black population, hispanic population, and median household income.

Opening the data in Excel, each column title was selected one by one and was edited so that a descriptive name was used instead of an abbreviation. These would then show up on the scatter plots as the title when they were made. Next, each entire column was selected and the insert tab was opened, then the scatter plot button was clicked (Figure 1-1). Each scatter plot was then edited to be consistent in style with the buttons that came up when hovering over the scatter plot. The scatter plots are shown in Figure 1-2. These scatter plots show the variance in distribution of each variable. On the X axis is the tract number, and on the Y axis is the amount of people.
Figure 1-1
Figure 1-2
The Pearson correlation coefficients were then found for each combination of variables using the IBM SPSS software. The Excel file was opened using the parameters in Figure 1-3, then the bivariate correlation button was used with the menu options in  Figure 1-4 and the parameters in Figure 1-5. The final resultant correlation matrix is shown and examined in the results section below.

Figure 1-3
Figure 1-4
Figure 1-5
Spatial Autocorrelation with GeoDa:

For this part of the assignment Texas Election Commission data on the 1980 and 2016 presidential elections was supplied by the instructor. Hispanic population data for 2015 was downloaded individually from the US Census Bureau. The specific data downloaded was the DP05 dataset (the 2015 ACS 5-year estimates). This data was then extracted to a personal folder. After extraction, the data was opened in Excel and the second row deleted because it confuses ArcMap. All other data besides the percentage hispanic was deleted as well to clean up the data, and the percent hispanic field was reformatted to be numbers. Now, a shapefile was downloaded from the census website covering all counties in Texas. This data was then joined in ArcMap with the data that was supplied by the instructor and the data already downloaded from the census and manipulated and the resulting data was exported as a shapefile for use in GeoDa.

After opening GeoDa, the file menu was clicked, and then new project from, then ESRI shapefile. The shapefile just created was opened. Next, a spatial weight was created for the spatial autocorrelation. The tool menu drawer was opened and the spatial weights manager was opened (Figure 1-6).
Figure 1-6
From here the create button was clicked, rook contiguity was selected, and then a new ID variable was made by clicking add ID variable, then selecting add. This can be seen in Figure 1-7. This is a unique identifier field that the software uses. The weight file then appeared in the weights manager window and the window was closed. 
Figure 1-7
Moran's I and LISA Cluster maps could now be created. Moran's I scatter plots were created by choosing the Moran scatterplot menu, then selecting the univariate option, then selecting the weight created earlier and the desired variable. This is shown in Figure 1-8. The resulting scatter plot was then copy and pasted into a folder.
Figure 1-8
To get LISA cluster maps for variables, the cluster map button was clicked, then the univariate local Moran's I button was clicked. This is shown in Figure 1-9. After clicking this, the variable desired was clicked, and then the cluster map choice was checked after which the cluster map was created.
Figure 1-9
Each cluster map was individually copied into a private folder.

Before stopping work with the Texas county data, a correlation matrix was processed in IBM SPSS just the same way as the Milwaukee tract data was made. This was saved.

Results:

Pearson's Correlations Using IBM SPSS:

The results of the IBM SPSS bivariate correlation function are show in the correlation matrix below in Figure 2-1. The matrix shows the correlations between each variable and every other variable. As each variable is shown twice, there are repeat correlations equal steps diagonal from the center Pearson correlation coefficients of 1 of each variable and itself. The significance level at which the program finds each variable relationship shows the significance of the relationship found. Though the amount of significance cannot be found, each correlation can be said to be significant on one, two, or zero levels. If the significance figure is less than 0.05 then it can be said with 95% that there is a relationship between the two variables. If the significance figure is less than 0.01 then it can be said with 99% certainty that there is a relationship between the variables.

Figure 2-1
Looking at the relationships in the correlation matrix, we can make statements about the individual relationships. These necessarily include the strength, direction, and significance of the relationship. The strongest relationships, both negative and positive, and the most null relationship are talked about here.

The strongest positive relationship is one between the white population and manufacturing employees because the strength of the relationship is a mighty r = 0.735. This is a positive relationship, when there are more white people in a tract there are more manufacturing employees and vice versa. The relationship is significant at the 0.01 level with a probability level of 0.000, which means there is 99% certainty that there is a relationship here.

The strongest negative relationship is the one between the black and white populations. The strength of this relationship is r = -0.582. This is a negative relationship, when there are more white people in a tract there are less black people and vice versa. This relationship is significant at the 0.01 level with its probability level of 0.000, which means there is also a 99% certainty that there is a correlation here. 

The most null relationship is the one with the r value closest to 0. In this matrix the most null relationship is the positive relationship between hispanic population and retail employees, a relationship with an r = 0.058. In significance testing this relationship scored a 0.318 probability level which fails both the 0.01 and 0.05 significance levels. This means that the positive relationship found with the sample is not significant and the null hypothesis that the population correlation coefficient p = 0 is not rejected.

Spatial Autocorrelation with GeoDa:

The hypothetical situation for this part of the assignment is that the TEC wants to know about clustering patterns of voter turnout and voting patterns and how they have changed overtime. Cluster maps and Moran's I univariate scatter plots of all of the variables provided were created to aid in this analysis. Also an SPSS correlation matrix was created. These are shown and discussed below.

Percent Democratic Vote in 1980 Election:

This variable has an I value right in the middle of all of the I values found. It is however lower than the most recent I value found for the 2016 election. It seems that people are clustering in a more partisan manner, or areas that were clustered one way have grown stronger. There are clear areas of clustering in both the 1980 and 2016 maps in the north and south central areas. There is low low clustering in the north and high high clustering in the south. The low low clustering in the north has moved over to the east by a few counties however now, and some areas of high high clustering that were in the east of Texas have disappeared and some high high clustering has appeared in the west.
Figure 2-2

Figure 2-3

Percent Democratic Vote in 2016 Election:

This map shows the changes described above in the the 1980 Democratic vote section. The Moran's I here is significantly larger which means more clustering.
Figure 2-4
Figure 2-5
Percent Hispanic by County:

This data shows an extremely high Moran's I, and thus has heavy clustering. The north east area of the state has low hispanic population surrounded by other low hispanic population, and the south west area of Texas, near the border, shows high percentages surrounded by more high percentages.
Figure 2-6

Figure 2-7

Percentage Voter Turnout in 1980:

Voter turnout is less clustered than hispanic population or democratic voting, but is still somewhat clustered. Low low clustered voter turnout is seen in the south and east portions of the state, while high high clustering is seen in the central and north portions, but to a lesser degree than the low low clustering. The clustering in 1980 of turnout is much more than the clustering in the 2016 voter turnout map (Figure 2-10). The same areas clustered in 2016 besides the loss of a low low section in the east of the state going missing, and a few low low areas appearing in new places in the central and west portions of the state.
Figure 2-8
Figure 2-9

Percentage Voter Turnout in 2016:

Voter turnout here was less clustered than in 1980. Some changes happened which are noted above.
Figure 2-10
Figure 2-11

SPSS Correlation Matrix:

The correlation matrix shows that all variables had significant correlations at the 0.01 or 99% significance level. Best correlated was the positive relationship between percentage of hispanic population and percent democratic vote in the 2016 election. This had an r value of 0.696. This was a lot higher than the insignificant positive correlation between hispanic population and democratic vote in 1980. The obvious assumption here is that the candidate on the republican side in 2016 drove up the amount of democratic votes but this cannot be concluded. Correlation does not equal causation. Other relationships that were interesting were found as well. The next highest r value was a negative one, a -0.623. This was negatively correlating hispanic population of a county with that county's 2016 voter turnout. This is interesting because it seems that with what was at stake in the 2016 election that with more hispanics, there would be a higher voter turnout percentage in a county.

Another correlation seen is a negative one between voter turnout and democratic voting percentage. With more percentage voter turnout, there are less democrats voting as a percentage of the county's votes.
Figure 2-12

Conclusion:

With correlation and hypothesis testing, relationships between variables tied to spatial data points can be found, and the relationships can be tested for significance and confidence through hypothesis testing or through the results of IBM's SPSS software. With the freeware spatial autocorrelation software GeoDa, developed at the University of New Mexico, and now being worked on in Illinois, clustering of individual variables can be found. These skills can be applied to many different kinds of data, but are especially useful in combination with election data and census data, two kinds of data naturally collected spatially, and which make up a huge chunk of interesting data on people in the United States.


No comments:

Post a Comment