Solutions — Exam III

.pdf

School

University of Nebraska, Lincoln *

*We aren’t endorsed by this school

Course

430

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

37

Uploaded by MateTankKudu37 on coursehero.com

Exam PA October 11, 2022 Project Statement This model solution is provided so that candidates may better prepare for future sittings of Exam PA. It includes both a sample solution, in plain text, and commentary from those grading the exam, in italics. In many cases there is a range of fully satisfactory approaches. This solution presents one such approach, with commentary on some alternatives, but there are valid alternatives not discussed here. General Information for Candidates This examination has 12 tasks numbered 1 through 12 with a total of 100 points. The points for each task are indicated at the beginning of the task, and the points for subtasks are shown with each subtask. Each task pertains to the business problem (and related data files) and data dictionary described below. Additional information on the business problem may be included in specific tasks—where additional information is provided, including variations in the target variable, it applies only to that task and not to other tasks. An .Rmd file accompanies this exam and provides useful R code for importing the data and, for some tasks, additional analysis and modeling. There are five datasets used in this exam. They are all subsets of a larger dataset that is not given to candidates. The .Rmd file has a chunk for each task. Each chunk starts by reading in one or more data files into one or more dataframes that will be used in the task. This ensures a common starting point for candidates for each task and allows them to be answered in any order. When the datafile is read, the variables it contains are assigned a type (e.g., “numerical,” ”factor”). The code that assigns variable types is easily changed (e.g., if month is read in as “numeric” but you want to treat it as a factor). The responses to each specific subtask should be written after the subtask and the answer label, which is typically ANSWER, in this Word document. Each subtask will be graded individually, so be sure any work that addresses a given subtask is done in the space provided for that subtask. Some subtasks have multiple labels for answers where multiple items are asked for—each answer label should have an answer after it. Where code, tables, or graphs from your own work in R is required, it should be copied and pasted into this Word document. Each task will be graded on the quality of your thought process (as documented in your submission), conclusions, and quality of the presentation. The answer should be confined to the question as set. No response to any task needs to be written as a formal report. Unless a subtask specifies otherwise, the audience for the responses is the examination grading team and technical language can be used. When “for a general audience” is specified, write for an audience not familiar with analytics acronyms (e.g., RMSE, GLM, etc.) or analytics concepts (e.g., log link, binarization). Prior to uploading your Word file, it should be saved and renamed with your five-digit candidate number in the file name. If any part of your exam was answered in French, also include “French” in the file name. Please keep the exam date as part of the file name. It is not required to upload your .Rmd file or other files used in determining your responses, as needed items from work in R will be copied over to the Word file as specified in the subtasks.
The Word file that contains your answers must be uploaded before the five-minute upload period time expires. Business Problem Your boss recently started a consulting firm, PA Consultants, specializing in predictive analytics. You and your assistant are the only other employees. Your boss informs you that a local politician from Baton Rouge, Louisiana, USA has hired your firm. Baton Rouge, a city of about 230,000 residents, is the capital of the state of Louisiana, USA. The client is about to launch a campaign with the mottos, “Clean up Baton Rouge” and “Treat all Neighborhoods Equally – including yours!” The client wants to improve garbage and waste collection. In particular, the client cares about shortening resolution times and ensuring equitable resolution times throughout the city. The client wants your ideas and inputs on the following: Understanding time trends Seeing whether different responding departments have different resolution times for similar tasks Predicting resolution times for any type(s) of complaint Your boss directs you to use a dataset 1 of public data that includes all the service requests from January 2016 – March 2022. There are over 300,000 service requests in this time period. Your assistant has prepared five subsets of the public data and has provided the following data dictionary that contains all the variables appearing in the subsets. Note that all variables do not appear in every subset datafile. 1 Source: City of Baton Rouge Parish of East Baton Rouge .
Data Dictionary Variable Name Variable Values Time.to.resolution Days from service request to resolution quarter “Q1”, “Q2”, “Q3”, “Q4”; quarter of service request month 1 to 12, month of service request year 2016 to 2022, year of service request year.mo 201601 to 202203, 100*year + month DEPARTMENT “GROUNDS”,”BLIGHT”,”SANITATION” LATITUDE Latitude of service location, 30.2 to 30.6 LONGITUDE Longitude of service location, -91.3 to -90.9 area “N”,”W”,”D”,”LSU”; neighborhood of service location Latitude_Binned Latitude range for binned data (geo.grid.csv only) Longitude_Binned Longitude range for binned data (geo.grid.csv only) Ave.time.to.resolution Average Time.to.resolution for binned data (geo.grid.csv only) call.count Number of service requests for binned data (geo.grid.csv only) TYPEid An id representing a specific type of service request Comments Requests for service do not appear in the dataset until they are resolved.
Task 1 ( 7 points ) Your boss asks you to review the quality of the data below. The data shows Time to Resolution for calls to pick up unwanted garbage carts. (This data is not found in any of the supplied files.) (a) ( 2 points ) Review the box plot below that your assistant made and describe an issue with the data. Candidates received full credit for identifying outliers with very high time to resolution as an issue and describing how the outliers may arise, patterns in the outliers, or how the outliers could cause problems in addressing the business problem. A common mistake was misidentifying the outliers as the body of the distribution and stating the actual boxplot represents unreasonable zero values, when in fact, this is an artifact of the scale of the y-axis caused by the high outliers. ANSWER: The plot shows many outlier resolution times greater than one year. These resolution times are unreasonable for trash services. This suggests either that services were never performed or that the cases were not closed at the time service was completed.
(b) ( 1 point ) List three options for handling the data issue. Candidates received full credit for listing three distinct options that addressed the data issue. The most common mistakes were listing options to improve the graph rather than handle the data issue (e.g., using a log scale) and giving vague response (e.g. listing “further investigation” as an option). ANSWER: 1. Remove outliers with very high time to resolution from the dataset 2. Leave the outliers in the dataset without any modification 3. Censor the time to resolution variable (c) ( 2 points ) Select and explain which option from part (b) you would recommend. Candidates performed well on this task overall. The most common recommendation was removing the outliers, but full credit was granted for any recommendation with a reasonable explanation. ANSWER: I recommend removing the data with excessive resolution times. It seems likely that the requests were not closed when the service was performed because these response times stretch over multiple years. (d) (2 points ) Your assistant produces the following output from a GLM. (Note your assistant redefined year as years since 2016.) This is a relatively straightforward calculation task, and candidates performed well overall. Varying amounts of partial credit were awarded to candidates with incorrect answers. Calculation errors (e.g., missing a coefficient in the formula) were awarded more partial credit than incorrect formulas (e.g., ignoring or misapplying the link function, incorrect residual calculation).
Calculate the residual for the predicted time to resolution using the values in the following table for a single observation. Show both the formula(s) used (with values substituted for variables) and the final value to two decimal places. TYPEid month year Area Time to Resolution 173023 2 4 N 5 ANSWER: ˆ exp(2.66173 0.637010 0.124969 4 0.123720 0.056956) 3.85 days 5 3.85 1.15 days y r = × = =− =
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help