disc13_gsi

.pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

100

Subject

Computer Science

Date

Apr 26, 2024

Type

pdf

Pages

Uploaded by ChefJackal3025 on coursehero.com

Data 100, Spring 2024 Discussion #13 Solutions Note: Your TA will probably not cover all the problems. This is fine; the discussion worksheets are not designed to be finished within an hour. They are deliberately made slightly longer so they can serve as resources you can use to practice, reinforce, and build upon concepts discussed in lectures, labs, and homework. PCA Basics 1. Consider the following dataset, where X is the corresponding 4 × 3 design matrix. The mean and variance for each of the features are also provided. Observations Feature 1 Feature 2 Feature 3 1 -3.59 7.39 -0.78 2 -8.37 -5.32 0.90 3 1.75 -0.61 -0.62 4 10.21 -1.46 0.50 Mean 0 0 0 Variance 47.56 21.35 0.51 Suppose we perform a singular value decomposition (SVD) on this data to obtain X = USV T : ( Note: U and V T are not perfectly orthonormal due to rounding to 2 decimal places. ) U =     − 0 . 25 0 . 81 0 . 20 − 0 . 61 − 0 . 56 0 . 24 0 . 13 − 0 . 06 − 0 . 85 0 . 74 − 0 . 18 0 . 41     , S =   13 . 79 0 0 0 9 . 32 0 0 0 0 . 81   , V T =   1 . 00 0 . 02 0 . 00 − 0 . 02 0 . 99 − 0 . 13 0 . 00 0 . 13 0 . 99   (a) Recall that XV contains the principal components of dataset X . and that we can alterna- tively calculate it using US . Prove, using the definition from lecture, that XV = US . Solution: Since V is orthonormal, we know that V T V = I . Starting with X = USV T , we right multiply by V on both sides: XV = USV T V = USI = US This completes the proof. Staff Notes: This is a great time to stress the properties of the U, Σ , V T . For example, V T being orthonormal means V − 1 = V T , what S being a diagonal matrix means, etc. 1

Discussion #13 2 (b) Compute the vector for the first principal component (round to 2 decimal places). Solution: We compute the first principal component by multiplying X by the first row of V T to get ≈ − 3 . 44 − 8 . 47 1 . 74 10 . 18 T (your values may differ slightly due to rounding). You can also compute the first PC by observing that XV = US . Therefore, the first principal component is also the first column of US . (c) What is the component score of the first principal component? In other words, how much variance does it capture of the original data X ? Solution: The variance captured by i -th principal component of the original data X is equal to ( i -th singular value ) 2 number of observations n In this case, n = 4 , and σ 1 = 13 . 79 . Therefore, the component score can be computed as follows: 13 . 79 2 4 = 47 . 54 (d) (Bonus) Given the results of (a), how can we interpret the rows of V T ? What do the values in these rows represent? Solution: Each principal component of X is a linear combination of X ’s features. The rows of V T correspond to the weights of each feature in the linear combinations that make up their respective principal components.

Discussion #13 3 Applications of PCA 2. Lillian wants to apply PCA to food PCA , a dataset of food nutrition information to understand the different food groups. She needs to preprocess her current dataset in order to use PCA. (a) What are the appropriate preprocessing steps when performing PCA on a dataset? □ A. Transform each row to have a magnitude of 1 (Normalization) □ B. Transform each column to have a mean of 0 (Centering) □ C. Transform each column to have a mean of 0 and a standard deviation of 1 (Standardization) □ D. None of the above Solution: We can use standardization or centering of the columns for PCA, since each column contains values of a particular feature for many observations. Standard- ization ensures that the the standard deviation of each collection of feature values is 1 , so that the variability in each feature across the data points is on a uniform scale. Additionally, we cannot compute the covariance matrix correctly using SVD if the feature columns are not centered with mean 0. Choice (A) is incorrect because it doesn’t really make sense to preprocess by row in PCA, since PCA is all about find- ing combinations of features (columns) as opposed to rows. (b) Assume you have correctly preprocessed your data using the correct response in part (a). Write a line of code that returns the first 3 principal components assuming you have the correctly preprocessed food PCA and the following variables returned by SVD. u, s, vt = np.linalg.svd(food_PCA, full_matrices = False) first_3_pcs = __________________________________

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Recommended textbooks for you

Database System Concepts

Computer Science

ISBN:9780078022159

Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan

Publisher:McGraw-Hill Education

Starting Out with Python (4th Edition)

Computer Science

ISBN:9780134444321

Author:Tony Gaddis

Publisher:PEARSON

Digital Fundamentals (11th Edition)

Computer Science

ISBN:9780132737968

Author:Thomas L. Floyd

Publisher:PEARSON

C How to Program (8th Edition)

Computer Science

ISBN:9780133976892

Author:Paul J. Deitel, Harvey Deitel

Publisher:PEARSON

Database Systems: Design, Implementation, & Manag...

Computer Science

ISBN:9781337627900

Author:Carlos Coronel, Steven Morris

Publisher:Cengage Learning

Programmable Logic Controllers

Computer Science

ISBN:9780073373843

Author:Frank D. Petruzella

Publisher:McGraw-Hill Education

SEE MORE TEXTBOOKS

Recommended textbooks for you

Database System Concepts
Computer Science
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:McGraw-Hill Education
Starting Out with Python (4th Edition)
Computer Science
ISBN:9780134444321
Author:Tony Gaddis
Publisher:PEARSON
Digital Fundamentals (11th Edition)
Computer Science
ISBN:9780132737968
Author:Thomas L. Floyd
Publisher:PEARSON
C How to Program (8th Edition)
Computer Science
ISBN:9780133976892
Author:Paul J. Deitel, Harvey Deitel
Publisher:PEARSON
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781337627900
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
Programmable Logic Controllers
Computer Science
ISBN:9780073373843
Author:Frank D. Petruzella
Publisher:McGraw-Hill Education