Wednesday, May 10, 2017

Assignment 6: Regression Analysis

Introduction:

Figure 1
Regression, like correlation, is a method of showing relationship, or lack of relationship between two variables. Unlike correlation however, this method investigates causation. One can make a prediction about the dependent variable with a hypothetical independent variable, as long as that independent variable falls between the minimum and maximum values of the preexisting data the regression analysis was run on. Linear regression analysis finds a best fit line with the ordinary least squares (OLS) method for the data, and also an equation for this line in the form y = A + Bx. A is the constant, or the y intercept. B, on the other hand is the slope of the line, or the regression coefficient. This shows how much the dependent variable changes with one positive change in the independent variable, and which direction that change is (positive or negative). A and B are found by using the equations shown in Figure 1.

Regression also results in a coefficient of determination. This is the r² value and represents the strength of the relationship being studied. This value ranges from zero to one. The closer to one that this value is, the stronger the relationship.

Deviation from regression models is shown by residuals. This is shown in this assignment with standard deviations away from the predictive model's value.

Regression can also be performed with multiple variables to create one equation that describes a data set using all of variables. This is called multiple regression.

This assignment deals with single regression in part one and multiple regression in part two. In part one, the relationship between percent of children receiving free lunches is examined, and in the second part the relationships between 911 calls and other variables are examined.

Part 1:

Methods, Results, and Discussion:

Opening the supplied Excel file containing crime statistics and the percent free lunch received by children at schools in SPSS, a linear regression analysis was performed to examine a possible relationship. This is shown in Figure 2. Percent of children receiving free lunch is used as the dependent variable, and the crime rate (per 100,00 people) is used as the independent variable due to the prompt for the assignment asking for verification of a news station's claim that as free lunches increase, so does crime. The results of this are shown in Figure 3.
Figure 2
Figure 3
The r² value is 0.173, signifying a weak relationship. Though this shows a very weak relationship (it is close to zero), it does show a relationship. It is also shown to be significant at the 0.005 significance level with a t-test, so the null hypothesis can be rejected in favor of alternative hypothesis that there is a relationship here. The news station is right in saying that as free lunch increases so does crime rate, however, just because there is a linear relationship does not mean that there is causation.

From the statistics, a best fit line can be found. This would be y = 40.380 + .102x. This can be used to predict crime values based on a given lunch percentage, or the other way around. This equation would predict a crime rate of 42.777 for a lunch percentage of 32.5. Because of the very low r² value however, this is not a very confident result. There are more variables that would need to be considered to confidently predict the result.

Part 2:

Introduction:

This part of the assignment compares 911 call with other variables on a census tract level. This analysis is useful to find out where a new hospital ER should be made. Three variables are first compared, with their residuals mapped, then multiple regression is performed, with a map made of the residuals from the equation line of the three most important variables found.

Single Regression Methods:

Opening the data supplied in IBM SPSS, three different linear regressions were run using calls as the dependent variable. These three were alcohol sales, number of people with no high school degree, and median income.

A choropleth map was made in ArcMap showing the amount of calls per tract and this is shown and described at the top of the results and discussion. Because number of people with no high school degree had the strongest relationship with number of 911 calls, a residual map was made of this variable.

Single Regression Results and Discussion:

In Figure 4 is shown the amount of calls per tract. This map was classified by equal interval due to this method's showing the range of the highest call areas the best. This method distinguishes the highest call areas from one another. The highest areas are shown in the north-central area with another area of particularly high rates of calls in the south-east.

Figure 4


Alcohol Sales:
Figure 5
Figure 5 shows the relationship between alcohol sales and calls. The r² value is relatively small here so alcohol sales does not explain well the amount of 911 calls however it is a factor. The significance level of 0.000 shows that the relationship is also significant, allowing the null hypothesis to be rejected in favor of the alternative, that there is a relationship here. For one unit of change in alcohol sales, calls will raise by 3.069E-5.

Number of People with No High School Degree:

Figure 6
Figure 6 shows the relationship between number of low educated people and 911 calls in a census tract. The r² value is larger here so more low educated people explains well the number of 911 calls. The significance level of 0.000 shows that the relationship is also significant, allowing the null hypothesis to be rejected in favor of the alternative, that there is a relationship here. For one unit of change in low educated people, calls will raise by 0.166.

Figure 7
Because this was the highest predictor of 911 calls processed, geoprocessing and mapping was used to show residuals. These values, although lower than with the others tested, still existed, as the r^2 value was still far from one. The residuals show where the actual values of the tracts deviate from the model created through the regression process. The geoprocessing used is shown in Figure 7, and the resulting map showing the standard deviations from the residual is shown in Figure 8. This figure shows how there are areas in north-central and eastern areas with particularly high rates of 911 calls that deviate positively from the predictive model's value. The presence of the high residuals on this map show that the variable of number of low educated people does not solely predict well the amount of 911 calls. It can be determined where to put a new ER from Figure 8 and 4. The areas with more calls, and with more calls than can be explained by the equation should be served by the new ER.
Figure 8


Median Income:
Figure 9
Figure 9 shows the relationship between median income and calls. The r² value is small here so median income does not explain well the amount of 911 calls however it is a factor. The significance level of 0.000 shows that the relationship is also significant, allowing the null hypothesis to be rejected in favor of the alternative, that there is a relationship here. For one unit of change in alcohol sales, calls will lower by 0.404.

Multiple Regression Methods:

Multiple regression was run by going into SPSS with the same data, then clicking Analyze, then Linear Regression, and making the number of 911 calls the dependent variable again. Then, instead of just adding one independent variable, all independent variables were added: Jobs, Renters, LowEduc (Number of people with no HS Degree), AlcoholX (alcohol sales), Unemployed, ForgnBorn (Foreign Born Pop), Med Income, and CollGrads (Number of College Grads). The multicollinearity diagnostic tool was then turned on. The results and discussion of the results can be seen below.

After the multiple regression was run with all of the variables selected, a step wise approach was run. The results of this are shown below.

Multiple Regression Results and Discussion:


Figure 10
Figure 10 shows all of the variables aforementioned when a multiple regression process is run with all of the variables entered. It also shows the r², a value that is moderately high, reflecting the sheer amount of variables that were used.

Figure 11
Figure 11 shows the same multiple regression with all of the variables used. The Beta coefficients show that low education is the most influential variable in the equation, and that the least influential variable is unemployed people. The collinearity diagnostics show that there is no collinearity because there is no condition index that is larger than 30. If one had been found that was above 30, then Variance Proportions that were closer to 1 would indicate where the collinearity was. This variable would then be eliminated, and multiple regression would be run again. Multicollinearity presence would make variables that were significant seem insignificant (smaller Beta values), and also would make the resulting equation less precise, and therefore is undesirable.
Figure 12

Figure 13

Figure 14
The above figures show the result of the stepwise multiple regression. The stepwise approach, as expected chose the three variables with the highest Beta values to build an equation with: Renters, Low Education, and Jobs. With these variables building the equation, the r² value is higher than every single variable regression model, and also all of the variables together. The r² value is 0.771. The equation that can be built with the three variables is: 911 calls = Renters*.024+LowEduc*.103+Jobs*.004. In this regression equation, Low Education is the most important variable with a Beta value of 0.464, then Jobs is next with 0.343, then Renters is last with 0.282. All variables are significant at high levels, the highest being 0.002.

Finally, with this equation, a residual map can be made similar to the one made above again using the Ordinary Least Square tool, this time using multiple independent (explanatory) variables. The resulting map is shown in Figure 15. It can be seen that the pattern of residuals has changed here, as the new equation now describes different areas better. The area with the highest standard deviations above what would be calculated with the equation should be served best with a new ER.

Figure 15
Conclusion:

This assignment helped to familiarize with and practice with the concepts of regression, multiple regression, multi-collinearity, and residuals, and to apply them in a spatial manner of analysis.





Monday, April 24, 2017

Assignment 5: Correlation and Autocorrelation

Introduction:

In this assignment, both correlations and spatial autocorrelations are found using statistical software and practice with interpretation is made. Microsoft Excel's scatter plot function is used to show variance in variables being examined, IMB SPSS's bivariate correlation function is used to get a correlation matrix (with Pearson correlation r values), and census data is downloaded, manipulated with Excel and ArcMap, and then used in GeoDa to find spatial autocorrelation (Moran's I) and then create a cluster map.

Correlation is best described as how well a dataset matches a best fit line. In other words it is how well one variable goes up or down when the other variable increases. When considering two different variables that are both available for a set of data points, the Pearson correlation coefficient or the r sample correlation coefficient value (the equation for which is in Figure 1) speaks to the relationship between these variables. A coefficient close to one shows a very strong positive correlation, while a coefficient close to negative one shows a strong negative correlation. A correlation that is close to zero on both the negative or positive side shows a weak or non-existent relationship. This is called a null relationship.

Figure 1


If there is a correlation this does not imply causation. There do exist spurious relationships between variables. In these cases another factor may be causing both of the variables to change, or the relationship and pattern of causation can be even more complicated.

Significance testing can also be done to find the significance of a found relationship coefficient between two variables. This is found automatically in the IBM SPSS statistics program however the route to finding significance is through hypothesis testing. In this case the null hypothesis is that p (the population correlation coefficient) = 0 and the alternative hypothesis is that p ≠ 0. The methods for hypothesis testing are the same, with t-testing or z-testing needing to be decided upon, the significance level decided upon, and the critical value or values (in case of a two-tailed test) needing to be found before finding running the test (shown in Figure 2) and comparing it to the critical values to find if the relationship is significant.



Spatial autocorrelation classifies areas surrounded by other areas into high high, high low, low high, or low low. High high is a classification for areas that have a high value of the variable being looked at and are also surrounded by areas with a high value. High low areas are areas with a high value but have areas surrounding with low values. Low High is the opposite of this last category. Low low is the categorization given to areas with a low value that are also surrounded by low values. When this classification scheme is mapped it is termed a cluster map because it shows the clusters of high values and low values. Normally patterns emerge spatially because things more alike tend to be close to one another. This means that usually there are a lot of high high classified areas, and a lot of low low classified areas, but not many low highs or high lows. The Moran's I is a spatial correlation statistic that describes this clustering. An I that is closer to one means more clustering. An I value that is closer to negative one means less clustering.


Methods:

Pearson's Correlations Using IBM SPSS:

Using a dataset provided by the instructor, data was processed with the IBM SPSS software to find the bivariate Pearson correlation coefficient for each column in relation to every other column. Before this happened the data was brought into Excel and scatter plots of each column of data were made. The data used was of census tract population information in seven categories: manufacturing employees, retail employees, finance employees, white population, black population, hispanic population, and median household income.

Opening the data in Excel, each column title was selected one by one and was edited so that a descriptive name was used instead of an abbreviation. These would then show up on the scatter plots as the title when they were made. Next, each entire column was selected and the insert tab was opened, then the scatter plot button was clicked (Figure 1-1). Each scatter plot was then edited to be consistent in style with the buttons that came up when hovering over the scatter plot. The scatter plots are shown in Figure 1-2. These scatter plots show the variance in distribution of each variable. On the X axis is the tract number, and on the Y axis is the amount of people.
Figure 1-1
Figure 1-2
The Pearson correlation coefficients were then found for each combination of variables using the IBM SPSS software. The Excel file was opened using the parameters in Figure 1-3, then the bivariate correlation button was used with the menu options in  Figure 1-4 and the parameters in Figure 1-5. The final resultant correlation matrix is shown and examined in the results section below.

Figure 1-3
Figure 1-4
Figure 1-5
Spatial Autocorrelation with GeoDa:

For this part of the assignment Texas Election Commission data on the 1980 and 2016 presidential elections was supplied by the instructor. Hispanic population data for 2015 was downloaded individually from the US Census Bureau. The specific data downloaded was the DP05 dataset (the 2015 ACS 5-year estimates). This data was then extracted to a personal folder. After extraction, the data was opened in Excel and the second row deleted because it confuses ArcMap. All other data besides the percentage hispanic was deleted as well to clean up the data, and the percent hispanic field was reformatted to be numbers. Now, a shapefile was downloaded from the census website covering all counties in Texas. This data was then joined in ArcMap with the data that was supplied by the instructor and the data already downloaded from the census and manipulated and the resulting data was exported as a shapefile for use in GeoDa.

After opening GeoDa, the file menu was clicked, and then new project from, then ESRI shapefile. The shapefile just created was opened. Next, a spatial weight was created for the spatial autocorrelation. The tool menu drawer was opened and the spatial weights manager was opened (Figure 1-6).
Figure 1-6
From here the create button was clicked, rook contiguity was selected, and then a new ID variable was made by clicking add ID variable, then selecting add. This can be seen in Figure 1-7. This is a unique identifier field that the software uses. The weight file then appeared in the weights manager window and the window was closed. 
Figure 1-7
Moran's I and LISA Cluster maps could now be created. Moran's I scatter plots were created by choosing the Moran scatterplot menu, then selecting the univariate option, then selecting the weight created earlier and the desired variable. This is shown in Figure 1-8. The resulting scatter plot was then copy and pasted into a folder.
Figure 1-8
To get LISA cluster maps for variables, the cluster map button was clicked, then the univariate local Moran's I button was clicked. This is shown in Figure 1-9. After clicking this, the variable desired was clicked, and then the cluster map choice was checked after which the cluster map was created.
Figure 1-9
Each cluster map was individually copied into a private folder.

Before stopping work with the Texas county data, a correlation matrix was processed in IBM SPSS just the same way as the Milwaukee tract data was made. This was saved.

Results:

Pearson's Correlations Using IBM SPSS:

The results of the IBM SPSS bivariate correlation function are show in the correlation matrix below in Figure 2-1. The matrix shows the correlations between each variable and every other variable. As each variable is shown twice, there are repeat correlations equal steps diagonal from the center Pearson correlation coefficients of 1 of each variable and itself. The significance level at which the program finds each variable relationship shows the significance of the relationship found. Though the amount of significance cannot be found, each correlation can be said to be significant on one, two, or zero levels. If the significance figure is less than 0.05 then it can be said with 95% that there is a relationship between the two variables. If the significance figure is less than 0.01 then it can be said with 99% certainty that there is a relationship between the variables.

Figure 2-1
Looking at the relationships in the correlation matrix, we can make statements about the individual relationships. These necessarily include the strength, direction, and significance of the relationship. The strongest relationships, both negative and positive, and the most null relationship are talked about here.

The strongest positive relationship is one between the white population and manufacturing employees because the strength of the relationship is a mighty r = 0.735. This is a positive relationship, when there are more white people in a tract there are more manufacturing employees and vice versa. The relationship is significant at the 0.01 level with a probability level of 0.000, which means there is 99% certainty that there is a relationship here.

The strongest negative relationship is the one between the black and white populations. The strength of this relationship is r = -0.582. This is a negative relationship, when there are more white people in a tract there are less black people and vice versa. This relationship is significant at the 0.01 level with its probability level of 0.000, which means there is also a 99% certainty that there is a correlation here. 

The most null relationship is the one with the r value closest to 0. In this matrix the most null relationship is the positive relationship between hispanic population and retail employees, a relationship with an r = 0.058. In significance testing this relationship scored a 0.318 probability level which fails both the 0.01 and 0.05 significance levels. This means that the positive relationship found with the sample is not significant and the null hypothesis that the population correlation coefficient p = 0 is not rejected.

Spatial Autocorrelation with GeoDa:

The hypothetical situation for this part of the assignment is that the TEC wants to know about clustering patterns of voter turnout and voting patterns and how they have changed overtime. Cluster maps and Moran's I univariate scatter plots of all of the variables provided were created to aid in this analysis. Also an SPSS correlation matrix was created. These are shown and discussed below.

Percent Democratic Vote in 1980 Election:

This variable has an I value right in the middle of all of the I values found. It is however lower than the most recent I value found for the 2016 election. It seems that people are clustering in a more partisan manner, or areas that were clustered one way have grown stronger. There are clear areas of clustering in both the 1980 and 2016 maps in the north and south central areas. There is low low clustering in the north and high high clustering in the south. The low low clustering in the north has moved over to the east by a few counties however now, and some areas of high high clustering that were in the east of Texas have disappeared and some high high clustering has appeared in the west.
Figure 2-2

Figure 2-3

Percent Democratic Vote in 2016 Election:

This map shows the changes described above in the the 1980 Democratic vote section. The Moran's I here is significantly larger which means more clustering.
Figure 2-4
Figure 2-5
Percent Hispanic by County:

This data shows an extremely high Moran's I, and thus has heavy clustering. The north east area of the state has low hispanic population surrounded by other low hispanic population, and the south west area of Texas, near the border, shows high percentages surrounded by more high percentages.
Figure 2-6

Figure 2-7

Percentage Voter Turnout in 1980:

Voter turnout is less clustered than hispanic population or democratic voting, but is still somewhat clustered. Low low clustered voter turnout is seen in the south and east portions of the state, while high high clustering is seen in the central and north portions, but to a lesser degree than the low low clustering. The clustering in 1980 of turnout is much more than the clustering in the 2016 voter turnout map (Figure 2-10). The same areas clustered in 2016 besides the loss of a low low section in the east of the state going missing, and a few low low areas appearing in new places in the central and west portions of the state.
Figure 2-8
Figure 2-9

Percentage Voter Turnout in 2016:

Voter turnout here was less clustered than in 1980. Some changes happened which are noted above.
Figure 2-10
Figure 2-11

SPSS Correlation Matrix:

The correlation matrix shows that all variables had significant correlations at the 0.01 or 99% significance level. Best correlated was the positive relationship between percentage of hispanic population and percent democratic vote in the 2016 election. This had an r value of 0.696. This was a lot higher than the insignificant positive correlation between hispanic population and democratic vote in 1980. The obvious assumption here is that the candidate on the republican side in 2016 drove up the amount of democratic votes but this cannot be concluded. Correlation does not equal causation. Other relationships that were interesting were found as well. The next highest r value was a negative one, a -0.623. This was negatively correlating hispanic population of a county with that county's 2016 voter turnout. This is interesting because it seems that with what was at stake in the 2016 election that with more hispanics, there would be a higher voter turnout percentage in a county.

Another correlation seen is a negative one between voter turnout and democratic voting percentage. With more percentage voter turnout, there are less democrats voting as a percentage of the county's votes.
Figure 2-12

Conclusion:

With correlation and hypothesis testing, relationships between variables tied to spatial data points can be found, and the relationships can be tested for significance and confidence through hypothesis testing or through the results of IBM's SPSS software. With the freeware spatial autocorrelation software GeoDa, developed at the University of New Mexico, and now being worked on in Illinois, clustering of individual variables can be found. These skills can be applied to many different kinds of data, but are especially useful in combination with election data and census data, two kinds of data naturally collected spatially, and which make up a huge chunk of interesting data on people in the United States.


Wednesday, April 5, 2017

Assignment 4: Hypothesis Testing

Introduction to Hypothesis Testing

T-Test or Z-Test
Elements of the Test

Z and T-tests are used to test whether or not a sample of a whole population or hypothesized situation is statistically different. A T-test is used when a sample is of less than 30 and a Z-test is used when the sample is greater than 30. The equation used for Z and T-testing is above. The process these tests are a part of is called hypothesis testing.

Hypothesis testing has 5 steps listed below:

–1. State the null hypothesis, Ho
      There is no significant difference between the sample mean and the mean for the entire population
–2. State the alternative hypothesis, Ha
      There is a significant difference between the sample mean and the mean for the entire population
–3. Choose a statistical test
      Use Z if n (size of sample) is greater than 30, or T if less than 30.
–4. Choose α or the level of significance
      α = 0.05 or α = 0.01 (usually) corresponding to a Z score (critical value) representing the place of  of 95% or 99% if a one-tailed test, and 97.5% or 99.5% if a two-tailed test. This is different however if a T-test is being performed as a different statistic (and a different critical value) will be used found by using degrees of freedom (sample population) and the α on the T-table.

–5. Calculate test statistic
      THIS IS DIFFERENT IF IT IS A T-TEST OR Z-TEST as talked about in step 4 above.

After calculating the test statistic, the null hypothesis can be rejected in favor of the alternative hypothesis, or can fail to be rejected. The null hypothesis is usually assumed true before evidence to the contrary (a T or Z-test pointing to the Ha) occurs.

Part 1


1. Trial Calculations:

Bolded fields were provided. α was found by subtracting the confidence interval from 100, then dividing the resulting value by 100, then dividing that resulting value by 2 if the interval type was 2 tailed. Next, each entry was deemed a Z or T-test depending on if n was greater or lesser than 30. Finally, a critical value was found from the Z table by finding the corresponding statistic to the α (1-α) in the table then finding the appropriate Z value using the column and row headings, or a critical value was found from the T table by using the n-1 value for the degrees of freedom and the α.


Interval Type
Confidence Level
n
α
z or t?
z or t value
A
2 Tailed
90
45
.05
Z
±1.28
B
2 Tailed
95
12
.025
T
±2.20099
C
1 Tailed
95
36
.05
Z
1.64
D
2 Tailed
99
180
.005
Z
±2.57
E
1 Tailed
80
60
.2
Z
0.84
F
1 Tailed
99
23
.01
T
2.50832
G
2 Tailed
99
15
.005
T
±2.97684

2.
A Department of Agriculture and Live Stock Development organization in Kenya estimate that yields in a certain district should approach the following amounts in metric tons (averages based on data from the whole country) per hectare: groundnuts. 0.57; cassava, 3.7; and beans, 0.29. A survey of 23 farmers had the following results:
Data for 2


a. Test the hypothesis for each of these products. Assume that each are 2 tailed with a Confidence Level of 95% *Use the appropriate test
b. Be sure to present the null and alternative hypotheses for each as well as conclusions
c. What are the probabilities values for each crop?
d. What are the similarities and differences in the results

For each crop I went through the five steps of hypothesis testing. I then analyzed the results in each step five, and finally compared all of them after performing each test.

Ground Nuts: 
1. Ho: There is no difference between the sample mean and the district estimate of ground nut yield. 
2. Ha: There is a difference between the sample mean and the district estimate of ground nut yield.
3. A T-test should be used because the sample size is less than 30.
4. The significance level is 95% with which α=.025 is found after considering the 2 tailed Interval Type. This corresponds with a critical value of ±2.07387
5. Running the T-test a value of -0.7993 is found corresponding to a probability value of 0.2148. This is in the interval from -2.07387 to 2.07387 so for the ground nuts the null hypothesis fails to be rejected.

Cassava:
1. Ho: There is no difference between the sample mean and the district estimate of cassava yield. 
2. Ha: There is a difference between the sample mean and the district estimate of cassava yield.
3. A T-test should be used because the sample size is less than 30.
4. The significance level is 95% with which α=.025 is found after considering the 2 tailed Interval Type. This corresponds with a critical value of ±2.07387
5. Running the T-test a value of -2.5578 is found corresponding to a probability value of 0.0054. This test statistic is outside of the interval set up by the critical value. This means that we reject the null hypothesis. It can then be said there is 95% certainty there is a difference between the sample and the hypothesis. 

Beans:
1. Ho: There is no difference between the sample mean and the district estimate of cassava yield. 
2. Ha: There is a difference between the sample mean and the district estimate of cassava yield.
3. A T-test should be used because the sample size is less than 30.
4. The significance level is 95% with which α=0.025 can found after considering the 2 tailed Interval Type. This corresponds with a critical value of ±2.07387
5. Running the T-test a value of 1.9983 is found corresponding to a probability value of 0.9767. This value falls in the interval set up by the critical value which means that the null hypothesis fails to be rejected.

Conclusions: Ground nut and bean sample average yields are not significantly different at the significance level set than the estimates while with the cassava sample average yield there is 95% certainty that the sample is less productive than the estimates.
3.
A researcher suspects that the level of a particular stream’s pollutant is higher than the allowable limit of 4.2 mg/l.  A sample of n= 17 reveals a mean pollutant level of 6.4 mg/l, with a standard deviation of 4.4.  What are your conclusions?  (one tailed test, 95% Significance Level) Please follow the hypothesis testing steps.  What is the corresponding probability value of your calculated answer
1. Ho: There is no difference between the sample level and the allowable level of pollutant in the stream.
2. Ha: The sample level of pollution reveals higher levels of pollutant than allowable in the stream.
3. A T-test should be used because the sample size is less than 30.
4. The significance level is 95% with which α=0.05 can be found considering this is a one-tailed test. The critical value found is 1.745884
5.Running the T-test on the sample data gives a value of 2.061553 corresponding to a probability of 0.9803

Conclusions: The T-test run gives a value that is greater than the critical value. This means the null hypothesis is rejected. It can be stated that it is 95% certain that the sample levels of pollution reveal higher levels of pollutant than are allowed in the stream.


Part 2

Introduction: 

Part 2 puts a spatial aspect into the realm of hypothesis testing. Using a shapefile of US Census block groups in the City of Eau Claire, another of block groups in Eau Claire County, processes in ArcMap, hypothesis testing, and mapping, the question of whether the average value of homes is significantly different for the city compared with county as a whole is answered.

Methods:

First calculations were made from the two groups. Statistics were found by right clicking on the appropriate column for the 2016 home values in the attribute tables in an ArcMap document of both of the shapefiles and clicking statistics. These then were processed the same as other hypothesis testings before, treating the City of Eau Claire group data as the sample of the whole county value data. 
Eau Claire City Home Value Statistics

Eau Claire County Home Value Statistics
Now, the five steps were followed to get the correct data.

1. Ho: There is no significant difference between the average home values of the City of Eau Claire block groups compared to the averages of the block groups for the whole county.
2. Ha: The average home value for the City of Eau Claire block groups is lower than the averages for the block groups of the entire county.
3. A Z-test should be performed because the n of the sample (the block groups of the City of Eau Claire) is greater than 30 (53).
4. A 95% significance level one-tailed test corresponding to an α of 0.05 should be used. The critical value for this using the Z-table is -1.64.
5. Using the data found earlier in ArcMap to calculate the Z-test, the value obtained is -2.572.

Conclusions and Discussion:

Because the resultant value of the Z-test is smaller than the critical value it can be concluded that with 95% certainty the average prices of homes are less in the City of Eau Claire than in the entire of the county.

The means of home values in the city and county are much less than the average prices of homes in the entire country which ranged from $340,600 to $385,700 in all of 2016.

Mapping:

 The data folder provided by the instructor was copied to a personal folder. then ArcMap was opened and the new document created saved in the personal folder as well. Now, dragging the two block group shapefiles in from the catalog on the right side of the screen, the setup of the view of the data was able to begin. A dissolve tool was first applied to the City of Eau Claire block group shapefile in order to remove internal boundaries for the creation of a later transparent but outlined feature class that could outline the block groups that were located inside the city. The transparency was set by selecting "no color" as the fill color and the outline set by selecting the color black and 2 as the outline width in the resulting "Symbol Selector" window from clicking on the symbol in the TOC (Table of Contents). The City of Eau Claire block groups shapefile was then turned off by unchecking the box in the TOC, and the symbology in the properties of the Eau Claire County block groups shapefile set to quantites, graduated symbols, the value set to "2016 Average Home Value," and the classification set to Natural Breaks with 5 classes. The values for display of the different classes were then changed to be more appropriate for display, the projected coordinate system of the data frame was set to NAD 1983 StatePlane Wisconsin Central, and other additions were made to the map, and it was exported. It can be seen below.

Eau Claire County and City Home 2016 Values by Block Group


Monday, February 20, 2017

Assignment 2


Introduction:

    This assignment practices the calculation of a variety of descriptive statistics methods both on paper and in Microsoft Excel. It also practices finding descriptive spatial statistics methods in ArcMap.

Part 1:

    To begin, definitions of the descriptive statistics methods that will be used need to be defined. The first is range. Range is the difference between the greatest and the least of the observations. Next is mean, otherwise known as the average. This is all of the observations added, then divided by the number of observations. After this we have the median. This is simply the middle observation, and if there is not a single middle observation then it is the average of the two middle observations. Then there is the mode. This is the observation that occurs the most. We then have the kurtosis of the the dataset. This is the "pointiness" of the dataset curve. Next we have the skewness. This is the degree to which the dataset curve is pulled to one side or the other. A positive skewness is a dataset with a tail on the curve leading right (pulled by outliers on the right side), and a negative skewness is one that is pulled by outliers on the left, leading to a tail on that side of the distribution frequency curve. Finally, we have the standard deviation of the curve. This statistic is a measure of how spread out the data is. Its formulas are shown in Figure 1 (population standard deviation) and Figure 2 (sample standard deviation). A larger standard deviation means that the data is wider spread.
Figure 1
(Taken from class lecture)
Figure 2
(Taken from class lecture)
    In practice of these statistical methods, I have analyzed two bicycle race teams' scores. The scores for the two teams are shown in Figure 3. I have first calculated the standard deviations for the two
Figure 3
teams by hand. These are shown in Figure 4 and Figure 5. The overall team stats are in Figure 6.
Figure 4
Figure 5
Figure 6
In analysis of these statistics I would say that I would absolutely choose to invest in team Tobler over team Astana. Despite Astana having the winning racer in this race, team Tobler's racers are consistently better. There is less variation in skill level in the team (smaller range and smaller standard deviation) and both the median and mean race times are faster than team Astana's. If the team that wins gains $400,000 in many ways, and 35% of this goes to the owner, and the racer that wins the race gets $300,000 and only 25% of the money goes to the owner I would much rather invest in the team that will win over the team with the racer that will win the race and in this case that team is team Tobler, whose members are consistently better.


Part 2:

    Statistics methods used must again be defined. In this section mean center and weighted mean center are used. Mean center is the center of an area defined by the centers of all of the subareas that make it up. For example, in this assignment the mean center is calculated by use of the centers of all of the counties of Wisconsin. All x values for these centers are averaged, and then all y values are averaged. The mean center is the (mean x, mean y) coordinate. When it comes to weighted mean center all coordinates for subareas are weighted by individual values. In this specific assignment those values are the populations for each county in 2000 and then 2015.

    For creation of the map below (Figure 7) I was supplied with a data table already normalized for use in ArcMap by my instructor. This table included a GEO_ID column for joining, a name column, and populations for each county in both 2000 and 2015. I imported this table into a new file geodatabase in a folder for this specific assignment, and then right clicked on my shapefile with all counties of Wisconsin which I got from the US Census website. I selected join, then joined my instructor supplied table using the GEO_ID column to match records. I now ran the mean center tool three times, the second two selecting the two different population columns for weighing. Weighted mean center equation is shown below in Figure 8.a
Figure 7
Figure 8
(Taken from class lecture)

    The weighted mean centers clearly show the weight of the two largest cities in Wisconsin: Milwaukee and Madison. Since 2000, it seems that the populations of Madison and other cities on the western side of the state have increased disproportionately to the increase in population that Milwaukee has seen. Looking back at the data it seems that this is very plausable, Dane County population having increased by 19.62% in the 15 year gap from 426,526 to 510,198, while Milwaukee County's population only increased 1.68% from 940,164 to 955,939.