Monday, April 24, 2017

Assignment 5: Correlation and Autocorrelation

Introduction:

In this assignment, both correlations and spatial autocorrelations are found using statistical software and practice with interpretation is made. Microsoft Excel's scatter plot function is used to show variance in variables being examined, IMB SPSS's bivariate correlation function is used to get a correlation matrix (with Pearson correlation r values), and census data is downloaded, manipulated with Excel and ArcMap, and then used in GeoDa to find spatial autocorrelation (Moran's I) and then create a cluster map.

Correlation is best described as how well a dataset matches a best fit line. In other words it is how well one variable goes up or down when the other variable increases. When considering two different variables that are both available for a set of data points, the Pearson correlation coefficient or the r sample correlation coefficient value (the equation for which is in Figure 1) speaks to the relationship between these variables. A coefficient close to one shows a very strong positive correlation, while a coefficient close to negative one shows a strong negative correlation. A correlation that is close to zero on both the negative or positive side shows a weak or non-existent relationship. This is called a null relationship.

Figure 1


If there is a correlation this does not imply causation. There do exist spurious relationships between variables. In these cases another factor may be causing both of the variables to change, or the relationship and pattern of causation can be even more complicated.

Significance testing can also be done to find the significance of a found relationship coefficient between two variables. This is found automatically in the IBM SPSS statistics program however the route to finding significance is through hypothesis testing. In this case the null hypothesis is that p (the population correlation coefficient) = 0 and the alternative hypothesis is that p ≠ 0. The methods for hypothesis testing are the same, with t-testing or z-testing needing to be decided upon, the significance level decided upon, and the critical value or values (in case of a two-tailed test) needing to be found before finding running the test (shown in Figure 2) and comparing it to the critical values to find if the relationship is significant.



Spatial autocorrelation classifies areas surrounded by other areas into high high, high low, low high, or low low. High high is a classification for areas that have a high value of the variable being looked at and are also surrounded by areas with a high value. High low areas are areas with a high value but have areas surrounding with low values. Low High is the opposite of this last category. Low low is the categorization given to areas with a low value that are also surrounded by low values. When this classification scheme is mapped it is termed a cluster map because it shows the clusters of high values and low values. Normally patterns emerge spatially because things more alike tend to be close to one another. This means that usually there are a lot of high high classified areas, and a lot of low low classified areas, but not many low highs or high lows. The Moran's I is a spatial correlation statistic that describes this clustering. An I that is closer to one means more clustering. An I value that is closer to negative one means less clustering.


Methods:

Pearson's Correlations Using IBM SPSS:

Using a dataset provided by the instructor, data was processed with the IBM SPSS software to find the bivariate Pearson correlation coefficient for each column in relation to every other column. Before this happened the data was brought into Excel and scatter plots of each column of data were made. The data used was of census tract population information in seven categories: manufacturing employees, retail employees, finance employees, white population, black population, hispanic population, and median household income.

Opening the data in Excel, each column title was selected one by one and was edited so that a descriptive name was used instead of an abbreviation. These would then show up on the scatter plots as the title when they were made. Next, each entire column was selected and the insert tab was opened, then the scatter plot button was clicked (Figure 1-1). Each scatter plot was then edited to be consistent in style with the buttons that came up when hovering over the scatter plot. The scatter plots are shown in Figure 1-2. These scatter plots show the variance in distribution of each variable. On the X axis is the tract number, and on the Y axis is the amount of people.
Figure 1-1
Figure 1-2
The Pearson correlation coefficients were then found for each combination of variables using the IBM SPSS software. The Excel file was opened using the parameters in Figure 1-3, then the bivariate correlation button was used with the menu options in  Figure 1-4 and the parameters in Figure 1-5. The final resultant correlation matrix is shown and examined in the results section below.

Figure 1-3
Figure 1-4
Figure 1-5
Spatial Autocorrelation with GeoDa:

For this part of the assignment Texas Election Commission data on the 1980 and 2016 presidential elections was supplied by the instructor. Hispanic population data for 2015 was downloaded individually from the US Census Bureau. The specific data downloaded was the DP05 dataset (the 2015 ACS 5-year estimates). This data was then extracted to a personal folder. After extraction, the data was opened in Excel and the second row deleted because it confuses ArcMap. All other data besides the percentage hispanic was deleted as well to clean up the data, and the percent hispanic field was reformatted to be numbers. Now, a shapefile was downloaded from the census website covering all counties in Texas. This data was then joined in ArcMap with the data that was supplied by the instructor and the data already downloaded from the census and manipulated and the resulting data was exported as a shapefile for use in GeoDa.

After opening GeoDa, the file menu was clicked, and then new project from, then ESRI shapefile. The shapefile just created was opened. Next, a spatial weight was created for the spatial autocorrelation. The tool menu drawer was opened and the spatial weights manager was opened (Figure 1-6).
Figure 1-6
From here the create button was clicked, rook contiguity was selected, and then a new ID variable was made by clicking add ID variable, then selecting add. This can be seen in Figure 1-7. This is a unique identifier field that the software uses. The weight file then appeared in the weights manager window and the window was closed. 
Figure 1-7
Moran's I and LISA Cluster maps could now be created. Moran's I scatter plots were created by choosing the Moran scatterplot menu, then selecting the univariate option, then selecting the weight created earlier and the desired variable. This is shown in Figure 1-8. The resulting scatter plot was then copy and pasted into a folder.
Figure 1-8
To get LISA cluster maps for variables, the cluster map button was clicked, then the univariate local Moran's I button was clicked. This is shown in Figure 1-9. After clicking this, the variable desired was clicked, and then the cluster map choice was checked after which the cluster map was created.
Figure 1-9
Each cluster map was individually copied into a private folder.

Before stopping work with the Texas county data, a correlation matrix was processed in IBM SPSS just the same way as the Milwaukee tract data was made. This was saved.

Results:

Pearson's Correlations Using IBM SPSS:

The results of the IBM SPSS bivariate correlation function are show in the correlation matrix below in Figure 2-1. The matrix shows the correlations between each variable and every other variable. As each variable is shown twice, there are repeat correlations equal steps diagonal from the center Pearson correlation coefficients of 1 of each variable and itself. The significance level at which the program finds each variable relationship shows the significance of the relationship found. Though the amount of significance cannot be found, each correlation can be said to be significant on one, two, or zero levels. If the significance figure is less than 0.05 then it can be said with 95% that there is a relationship between the two variables. If the significance figure is less than 0.01 then it can be said with 99% certainty that there is a relationship between the variables.

Figure 2-1
Looking at the relationships in the correlation matrix, we can make statements about the individual relationships. These necessarily include the strength, direction, and significance of the relationship. The strongest relationships, both negative and positive, and the most null relationship are talked about here.

The strongest positive relationship is one between the white population and manufacturing employees because the strength of the relationship is a mighty r = 0.735. This is a positive relationship, when there are more white people in a tract there are more manufacturing employees and vice versa. The relationship is significant at the 0.01 level with a probability level of 0.000, which means there is 99% certainty that there is a relationship here.

The strongest negative relationship is the one between the black and white populations. The strength of this relationship is r = -0.582. This is a negative relationship, when there are more white people in a tract there are less black people and vice versa. This relationship is significant at the 0.01 level with its probability level of 0.000, which means there is also a 99% certainty that there is a correlation here. 

The most null relationship is the one with the r value closest to 0. In this matrix the most null relationship is the positive relationship between hispanic population and retail employees, a relationship with an r = 0.058. In significance testing this relationship scored a 0.318 probability level which fails both the 0.01 and 0.05 significance levels. This means that the positive relationship found with the sample is not significant and the null hypothesis that the population correlation coefficient p = 0 is not rejected.

Spatial Autocorrelation with GeoDa:

The hypothetical situation for this part of the assignment is that the TEC wants to know about clustering patterns of voter turnout and voting patterns and how they have changed overtime. Cluster maps and Moran's I univariate scatter plots of all of the variables provided were created to aid in this analysis. Also an SPSS correlation matrix was created. These are shown and discussed below.

Percent Democratic Vote in 1980 Election:

This variable has an I value right in the middle of all of the I values found. It is however lower than the most recent I value found for the 2016 election. It seems that people are clustering in a more partisan manner, or areas that were clustered one way have grown stronger. There are clear areas of clustering in both the 1980 and 2016 maps in the north and south central areas. There is low low clustering in the north and high high clustering in the south. The low low clustering in the north has moved over to the east by a few counties however now, and some areas of high high clustering that were in the east of Texas have disappeared and some high high clustering has appeared in the west.
Figure 2-2

Figure 2-3

Percent Democratic Vote in 2016 Election:

This map shows the changes described above in the the 1980 Democratic vote section. The Moran's I here is significantly larger which means more clustering.
Figure 2-4
Figure 2-5
Percent Hispanic by County:

This data shows an extremely high Moran's I, and thus has heavy clustering. The north east area of the state has low hispanic population surrounded by other low hispanic population, and the south west area of Texas, near the border, shows high percentages surrounded by more high percentages.
Figure 2-6

Figure 2-7

Percentage Voter Turnout in 1980:

Voter turnout is less clustered than hispanic population or democratic voting, but is still somewhat clustered. Low low clustered voter turnout is seen in the south and east portions of the state, while high high clustering is seen in the central and north portions, but to a lesser degree than the low low clustering. The clustering in 1980 of turnout is much more than the clustering in the 2016 voter turnout map (Figure 2-10). The same areas clustered in 2016 besides the loss of a low low section in the east of the state going missing, and a few low low areas appearing in new places in the central and west portions of the state.
Figure 2-8
Figure 2-9

Percentage Voter Turnout in 2016:

Voter turnout here was less clustered than in 1980. Some changes happened which are noted above.
Figure 2-10
Figure 2-11

SPSS Correlation Matrix:

The correlation matrix shows that all variables had significant correlations at the 0.01 or 99% significance level. Best correlated was the positive relationship between percentage of hispanic population and percent democratic vote in the 2016 election. This had an r value of 0.696. This was a lot higher than the insignificant positive correlation between hispanic population and democratic vote in 1980. The obvious assumption here is that the candidate on the republican side in 2016 drove up the amount of democratic votes but this cannot be concluded. Correlation does not equal causation. Other relationships that were interesting were found as well. The next highest r value was a negative one, a -0.623. This was negatively correlating hispanic population of a county with that county's 2016 voter turnout. This is interesting because it seems that with what was at stake in the 2016 election that with more hispanics, there would be a higher voter turnout percentage in a county.

Another correlation seen is a negative one between voter turnout and democratic voting percentage. With more percentage voter turnout, there are less democrats voting as a percentage of the county's votes.
Figure 2-12

Conclusion:

With correlation and hypothesis testing, relationships between variables tied to spatial data points can be found, and the relationships can be tested for significance and confidence through hypothesis testing or through the results of IBM's SPSS software. With the freeware spatial autocorrelation software GeoDa, developed at the University of New Mexico, and now being worked on in Illinois, clustering of individual variables can be found. These skills can be applied to many different kinds of data, but are especially useful in combination with election data and census data, two kinds of data naturally collected spatially, and which make up a huge chunk of interesting data on people in the United States.


Wednesday, April 5, 2017

Assignment 4: Hypothesis Testing

Introduction to Hypothesis Testing

T-Test or Z-Test
Elements of the Test

Z and T-tests are used to test whether or not a sample of a whole population or hypothesized situation is statistically different. A T-test is used when a sample is of less than 30 and a Z-test is used when the sample is greater than 30. The equation used for Z and T-testing is above. The process these tests are a part of is called hypothesis testing.

Hypothesis testing has 5 steps listed below:

–1. State the null hypothesis, Ho
      There is no significant difference between the sample mean and the mean for the entire population
–2. State the alternative hypothesis, Ha
      There is a significant difference between the sample mean and the mean for the entire population
–3. Choose a statistical test
      Use Z if n (size of sample) is greater than 30, or T if less than 30.
–4. Choose α or the level of significance
      α = 0.05 or α = 0.01 (usually) corresponding to a Z score (critical value) representing the place of  of 95% or 99% if a one-tailed test, and 97.5% or 99.5% if a two-tailed test. This is different however if a T-test is being performed as a different statistic (and a different critical value) will be used found by using degrees of freedom (sample population) and the α on the T-table.

–5. Calculate test statistic
      THIS IS DIFFERENT IF IT IS A T-TEST OR Z-TEST as talked about in step 4 above.

After calculating the test statistic, the null hypothesis can be rejected in favor of the alternative hypothesis, or can fail to be rejected. The null hypothesis is usually assumed true before evidence to the contrary (a T or Z-test pointing to the Ha) occurs.

Part 1


1. Trial Calculations:

Bolded fields were provided. α was found by subtracting the confidence interval from 100, then dividing the resulting value by 100, then dividing that resulting value by 2 if the interval type was 2 tailed. Next, each entry was deemed a Z or T-test depending on if n was greater or lesser than 30. Finally, a critical value was found from the Z table by finding the corresponding statistic to the α (1-α) in the table then finding the appropriate Z value using the column and row headings, or a critical value was found from the T table by using the n-1 value for the degrees of freedom and the α.


Interval Type
Confidence Level
n
α
z or t?
z or t value
A
2 Tailed
90
45
.05
Z
±1.28
B
2 Tailed
95
12
.025
T
±2.20099
C
1 Tailed
95
36
.05
Z
1.64
D
2 Tailed
99
180
.005
Z
±2.57
E
1 Tailed
80
60
.2
Z
0.84
F
1 Tailed
99
23
.01
T
2.50832
G
2 Tailed
99
15
.005
T
±2.97684

2.
A Department of Agriculture and Live Stock Development organization in Kenya estimate that yields in a certain district should approach the following amounts in metric tons (averages based on data from the whole country) per hectare: groundnuts. 0.57; cassava, 3.7; and beans, 0.29. A survey of 23 farmers had the following results:
Data for 2


a. Test the hypothesis for each of these products. Assume that each are 2 tailed with a Confidence Level of 95% *Use the appropriate test
b. Be sure to present the null and alternative hypotheses for each as well as conclusions
c. What are the probabilities values for each crop?
d. What are the similarities and differences in the results

For each crop I went through the five steps of hypothesis testing. I then analyzed the results in each step five, and finally compared all of them after performing each test.

Ground Nuts: 
1. Ho: There is no difference between the sample mean and the district estimate of ground nut yield. 
2. Ha: There is a difference between the sample mean and the district estimate of ground nut yield.
3. A T-test should be used because the sample size is less than 30.
4. The significance level is 95% with which α=.025 is found after considering the 2 tailed Interval Type. This corresponds with a critical value of ±2.07387
5. Running the T-test a value of -0.7993 is found corresponding to a probability value of 0.2148. This is in the interval from -2.07387 to 2.07387 so for the ground nuts the null hypothesis fails to be rejected.

Cassava:
1. Ho: There is no difference between the sample mean and the district estimate of cassava yield. 
2. Ha: There is a difference between the sample mean and the district estimate of cassava yield.
3. A T-test should be used because the sample size is less than 30.
4. The significance level is 95% with which α=.025 is found after considering the 2 tailed Interval Type. This corresponds with a critical value of ±2.07387
5. Running the T-test a value of -2.5578 is found corresponding to a probability value of 0.0054. This test statistic is outside of the interval set up by the critical value. This means that we reject the null hypothesis. It can then be said there is 95% certainty there is a difference between the sample and the hypothesis. 

Beans:
1. Ho: There is no difference between the sample mean and the district estimate of cassava yield. 
2. Ha: There is a difference between the sample mean and the district estimate of cassava yield.
3. A T-test should be used because the sample size is less than 30.
4. The significance level is 95% with which α=0.025 can found after considering the 2 tailed Interval Type. This corresponds with a critical value of ±2.07387
5. Running the T-test a value of 1.9983 is found corresponding to a probability value of 0.9767. This value falls in the interval set up by the critical value which means that the null hypothesis fails to be rejected.

Conclusions: Ground nut and bean sample average yields are not significantly different at the significance level set than the estimates while with the cassava sample average yield there is 95% certainty that the sample is less productive than the estimates.
3.
A researcher suspects that the level of a particular stream’s pollutant is higher than the allowable limit of 4.2 mg/l.  A sample of n= 17 reveals a mean pollutant level of 6.4 mg/l, with a standard deviation of 4.4.  What are your conclusions?  (one tailed test, 95% Significance Level) Please follow the hypothesis testing steps.  What is the corresponding probability value of your calculated answer
1. Ho: There is no difference between the sample level and the allowable level of pollutant in the stream.
2. Ha: The sample level of pollution reveals higher levels of pollutant than allowable in the stream.
3. A T-test should be used because the sample size is less than 30.
4. The significance level is 95% with which α=0.05 can be found considering this is a one-tailed test. The critical value found is 1.745884
5.Running the T-test on the sample data gives a value of 2.061553 corresponding to a probability of 0.9803

Conclusions: The T-test run gives a value that is greater than the critical value. This means the null hypothesis is rejected. It can be stated that it is 95% certain that the sample levels of pollution reveal higher levels of pollutant than are allowed in the stream.


Part 2

Introduction: 

Part 2 puts a spatial aspect into the realm of hypothesis testing. Using a shapefile of US Census block groups in the City of Eau Claire, another of block groups in Eau Claire County, processes in ArcMap, hypothesis testing, and mapping, the question of whether the average value of homes is significantly different for the city compared with county as a whole is answered.

Methods:

First calculations were made from the two groups. Statistics were found by right clicking on the appropriate column for the 2016 home values in the attribute tables in an ArcMap document of both of the shapefiles and clicking statistics. These then were processed the same as other hypothesis testings before, treating the City of Eau Claire group data as the sample of the whole county value data. 
Eau Claire City Home Value Statistics

Eau Claire County Home Value Statistics
Now, the five steps were followed to get the correct data.

1. Ho: There is no significant difference between the average home values of the City of Eau Claire block groups compared to the averages of the block groups for the whole county.
2. Ha: The average home value for the City of Eau Claire block groups is lower than the averages for the block groups of the entire county.
3. A Z-test should be performed because the n of the sample (the block groups of the City of Eau Claire) is greater than 30 (53).
4. A 95% significance level one-tailed test corresponding to an α of 0.05 should be used. The critical value for this using the Z-table is -1.64.
5. Using the data found earlier in ArcMap to calculate the Z-test, the value obtained is -2.572.

Conclusions and Discussion:

Because the resultant value of the Z-test is smaller than the critical value it can be concluded that with 95% certainty the average prices of homes are less in the City of Eau Claire than in the entire of the county.

The means of home values in the city and county are much less than the average prices of homes in the entire country which ranged from $340,600 to $385,700 in all of 2016.

Mapping:

 The data folder provided by the instructor was copied to a personal folder. then ArcMap was opened and the new document created saved in the personal folder as well. Now, dragging the two block group shapefiles in from the catalog on the right side of the screen, the setup of the view of the data was able to begin. A dissolve tool was first applied to the City of Eau Claire block group shapefile in order to remove internal boundaries for the creation of a later transparent but outlined feature class that could outline the block groups that were located inside the city. The transparency was set by selecting "no color" as the fill color and the outline set by selecting the color black and 2 as the outline width in the resulting "Symbol Selector" window from clicking on the symbol in the TOC (Table of Contents). The City of Eau Claire block groups shapefile was then turned off by unchecking the box in the TOC, and the symbology in the properties of the Eau Claire County block groups shapefile set to quantites, graduated symbols, the value set to "2016 Average Home Value," and the classification set to Natural Breaks with 5 classes. The values for display of the different classes were then changed to be more appropriate for display, the projected coordinate system of the data frame was set to NAD 1983 StatePlane Wisconsin Central, and other additions were made to the map, and it was exported. It can be seen below.

Eau Claire County and City Home 2016 Values by Block Group