Predict Movie Collections

13 minute read

Objective: Given a dataset of movies, train a model to predict the collection of the movies once released. Also, we would compare Linear, Ridge and Lasso Regressions to determine which one is best suited here.

Data used in the below analysis: link.

#importing libraries required
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#importing the dataset
movies_data = pd.read_csv('Data Files/Movie_collection_test.csv')
#Quick look at the data
movies_data.head(5)
Collection Marketin_expense Production_expense Multiplex_coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating Trailer_views Time_taken Twitter_hastags Genre Avg_age_actors MPAA_film_rating Num_multiplex 3D_available
0 11200 520.9220 91.2 0.307 33257.785 173.5 9.135 9.31 9.040 9.335 7.96 308973 184.24 220.896 Drama 30 PG 618 YES
1 14400 304.7240 91.2 0.307 35235.365 173.5 9.120 9.33 9.095 9.305 7.96 374897 146.88 201.152 Comedy 50 PG 703 YES
2 24200 211.9142 91.2 0.307 35574.220 173.5 9.170 9.32 9.115 9.120 7.96 359036 108.84 281.936 Thriller 42 PG 689 NO
3 16600 516.0340 91.2 0.307 29713.695 169.5 9.125 9.31 9.060 9.100 6.96 384237 NaN 301.328 Thriller 40 PG 677 YES
4 17000 850.5840 91.2 0.307 30724.705 158.9 9.050 9.22 9.185 9.330 7.96 312011 169.40 221.360 Comedy 56 PG 615 NO
movies_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Collection           506 non-null    int64  
 1   Marketin_expense     506 non-null    float64
 2   Production_expense   506 non-null    float64
 3   Multiplex_coverage   506 non-null    float64
 4   Budget               506 non-null    float64
 5   Movie_length         506 non-null    float64
 6   Lead_ Actor_Rating   506 non-null    float64
 7   Lead_Actress_rating  506 non-null    float64
 8   Director_rating      506 non-null    float64
 9   Producer_rating      506 non-null    float64
 10  Critic_rating        506 non-null    float64
 11  Trailer_views        506 non-null    int64  
 12  Time_taken           494 non-null    float64
 13  Twitter_hastags      506 non-null    float64
 14  Genre                506 non-null    object
 15  Avg_age_actors       506 non-null    int64  
 16  MPAA_film_rating     506 non-null    object
 17  Num_multiplex        506 non-null    int64  
 18  3D_available         506 non-null    object
dtypes: float64(12), int64(4), object(3)
memory usage: 75.2+ KB

There are missing values in the Time_Taken field.

movies_data.describe()
Collection Marketin_expense Production_expense Multiplex_coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating Trailer_views Time_taken Twitter_hastags Avg_age_actors Num_multiplex
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 494.000000 506.000000 506.000000 506.000000
mean 45057.707510 92.270471 77.273557 0.445305 34911.144022 142.074901 8.014002 8.185613 8.019664 8.190514 7.810870 449860.715415 157.391498 260.832095 39.181818 545.043478
std 18364.351764 172.030902 13.720706 0.115878 3903.038232 28.148861 1.054266 1.054290 1.059899 1.049601 0.659699 68917.763145 31.295161 104.779133 12.513697 106.332889
min 10000.000000 20.126400 55.920000 0.129000 19781.355000 76.400000 3.840000 4.035000 3.840000 4.030000 6.600000 212912.000000 0.000000 201.152000 3.000000 333.000000
25% 34050.000000 21.640900 65.380000 0.376000 32693.952500 118.525000 7.316250 7.503750 7.296250 7.507500 7.200000 409128.000000 132.300000 223.796000 28.000000 465.000000
50% 42400.000000 25.130200 74.380000 0.462000 34488.217500 151.000000 8.307500 8.495000 8.312500 8.465000 7.960000 462460.000000 160.000000 254.400000 39.000000 535.500000
75% 50000.000000 93.541650 91.200000 0.551000 36793.542500 167.575000 8.865000 9.030000 8.883750 9.030000 8.260000 500247.500000 181.890000 283.416000 50.000000 614.750000
max 100000.000000 1799.524000 110.480000 0.615000 48772.900000 173.500000 9.435000 9.540000 9.425000 9.635000 9.400000 567784.000000 217.520000 2022.400000 60.000000 868.000000

_Marketin_Experience and Bugdet need another look as their mean, median and max values are expanding over a huge range. _

EDA!

sns.jointplot(x='Marketin_expense',y='Collection',data=movies_data)

Outliers are present which need to be treated as a part of pre-processing.

sns.jointplot(x='Budget',y='Collection',data=movies_data)

Budget seems to be fine as we’ll use it as is.
Let’s check our categorical data.

sns.countplot(x='Genre',data=movies_data)

sns.countplot(x='3D_available',data=movies_data)

sns.countplot(x='MPAA_film_rating',data=movies_data)

Our categorical seems fine to use except MPAA_film_rating. As it has only one value it won’t affect our model in any way. We can drop it.

movies_data.drop('MPAA_film_rating',axis=1, inplace=True)

Treating the outliers

We would use capping to treat the higher values in Marketin_Expense.

#checking the min and max value
movies_data['Marketin_expense'].min()
20.1264
movies_data['Marketin_expense'].max()
1799.524
#Capping the numbers above 1.5 times the 99 percentile
ul = np.percentile(movies_data['Marketin_expense'],[99])[0]
movies_data[movies_data['Marketin_expense'] > 1.5*ul]
Collection Marketin_expense Production_expense Multiplex_coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating Trailer_views Time_taken Twitter_hastags Genre Avg_age_actors MPAA_film_rating Num_multiplex 3D_available
5 10000 1378.416 91.2 0.307 31569.065 173.5 9.235 9.405 9.280 9.23 6.96 342621 146.00 280.800 Thriller 38 PG 654 YES
18 17600 1490.682 91.2 0.321 33091.135 173.5 9.020 9.155 9.075 9.15 7.96 383325 169.52 241.408 Comedy 52 PG 680 NO
486 20800 1799.524 91.2 0.329 38707.240 165.4 9.170 9.430 9.155 9.41 6.96 417588 188.16 281.664 Comedy 21 PG 666 YES
movies_data.Marketin_expense[movies_data['Marketin_expense'] > 1.5*ul] = 1.5*ul
movies_data[movies_data['Marketin_expense'] > ul]
<ipython-input-23-2002ed2d8b50>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_data.Marketin_expense[movies_data['Marketin_expense'] > 1.5*ul] = 1.5*ul
Collection Marketin_expense Production_expense Multiplex_coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating Trailer_views Time_taken Twitter_hastags Genre Avg_age_actors MPAA_film_rating Num_multiplex 3D_available
4 17000 850.5840 91.2 0.307 30724.705 158.9 9.050 9.220 9.185 9.330 7.96 312011 169.40 221.360 Comedy 56 PG 615 NO
5 10000 1271.1099 91.2 0.307 31569.065 173.5 9.235 9.405 9.280 9.230 6.96 342621 146.00 280.800 Thriller 38 PG 654 YES
10 30000 1042.7160 91.2 0.403 31980.135 173.5 9.155 9.340 9.210 9.470 6.96 474055 192.00 222.400 Thriller 52 PG 617 NO
14 14000 934.9220 91.2 0.307 25103.045 173.5 9.130 9.250 9.050 9.255 7.96 212912 120.80 241.120 Thriller 40 PG 693 YES
18 17600 1271.1099 91.2 0.321 33091.135 173.5 9.020 9.155 9.075 9.150 7.96 383325 169.52 241.408 Comedy 52 PG 680 NO
486 20800 1271.1099 91.2 0.329 38707.240 165.4 9.170 9.430 9.155 9.410 6.96 417588 188.16 281.664 Comedy 21 PG 666 YES

Treating the missing data

movies_data.Time_taken.mean()
157.39149797570857
#missing data
movies_data[movies_data['Time_taken'].isnull()]
Collection Marketin_expense Production_expense Multiplex_coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating Trailer_views Time_taken Twitter_hastags Genre Avg_age_actors MPAA_film_rating Num_multiplex 3D_available
3 16600 516.0340 91.20 0.307 29713.695 169.5 9.125 9.310 9.060 9.100 6.96 384237 NaN 301.328 Thriller 40 PG 677 YES
16 15000 236.6840 91.20 0.321 37674.010 164.3 9.050 9.230 8.980 9.100 7.96 335532 NaN 201.200 Thriller 35 PG 647 YES
40 21000 461.0220 91.20 0.260 32318.990 165.9 8.985 9.170 9.020 9.095 7.96 360183 NaN 241.680 Comedy 38 PG 753 NO
96 39400 25.7920 74.38 0.415 29941.450 146.4 8.570 8.695 8.510 8.630 7.16 380129 NaN 243.152 Thriller 44 PG 611 NO
126 27200 45.0358 71.28 0.462 30941.350 171.6 8.035 8.205 7.955 8.210 7.80 371051 NaN 302.176 Action 44 PG 484 YES
164 46600 23.0890 65.26 0.547 34135.475 102.7 6.010 6.115 5.965 6.280 7.06 480067 NaN 283.728 Comedy 22 PG 438 NO
166 37400 22.9864 65.26 0.547 31891.255 139.7 6.335 6.420 6.235 6.560 7.06 465689 NaN 222.992 Thriller 30 PG 439 NO
210 40200 22.7920 72.12 0.480 34257.685 163.5 8.685 8.875 8.660 8.935 6.82 432081 NaN 203.216 Comedy 20 PG 458 YES
211 39000 22.6524 72.12 0.480 32502.305 170.2 8.905 9.025 8.935 8.925 6.82 430817 NaN 263.120 Comedy 57 PG 515 YES
321 50000 23.9604 76.18 0.511 34341.010 115.9 7.925 8.095 8.020 8.065 7.28 456943 NaN 244.000 Drama 30 PG 480 YES
366 67600 30.8022 62.94 0.353 40012.665 155.3 8.940 9.025 8.815 8.995 9.40 483080 NaN 225.408 Drama 21 PG 681 YES
465 45200 105.2262 91.20 0.230 33952.160 154.8 8.610 8.810 8.720 8.845 6.96 437945 NaN 283.616 Drama 26 PG 743 NO
#updating 12 missing values with the mean value
movies_data.Time_taken = movies_data.Time_taken.fillna(movies_data.Time_taken.mean())
movies_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Collection           506 non-null    int64  
 1   Marketin_expense     506 non-null    float64
 2   Production_expense   506 non-null    float64
 3   Multiplex_coverage   506 non-null    float64
 4   Budget               506 non-null    float64
 5   Movie_length         506 non-null    float64
 6   Lead_ Actor_Rating   506 non-null    float64
 7   Lead_Actress_rating  506 non-null    float64
 8   Director_rating      506 non-null    float64
 9   Producer_rating      506 non-null    float64
 10  Critic_rating        506 non-null    float64
 11  Trailer_views        506 non-null    int64  
 12  Time_taken           506 non-null    float64
 13  Twitter_hastags      506 non-null    float64
 14  Genre                506 non-null    object
 15  Avg_age_actors       506 non-null    int64  
 16  MPAA_film_rating     506 non-null    object
 17  Num_multiplex        506 non-null    int64  
 18  3D_available         506 non-null    object
dtypes: float64(12), int64(4), object(3)
memory usage: 75.2+ KB

Variable Transformation

This is not a mandatory step but we do it in hopes to get better result!

sns.pairplot(data=movies_data)

Marketin_expense and Trailer_views seem to have a non-linear relationship with Collection, let’s explore them further!

sns.jointplot(x='Marketin_expense', y='Collection', data=movies_data)

It seems like a log relationship but its not very strong so we would ignore it.

sns.jointplot(x='Trailer_views', y='Collection', data=movies_data)

This is an exp relationship, let’s convert it into a linear one for the ease of our model.

movies_data.Trailer_views = np.exp(movies_data.Trailer_views/100000)
sns.jointplot(x='Trailer_views', y='Collection', data=movies_data)

Now we have a more linear relationship between Trailer_views and collection.

Converting Categorical Data into dummy variables

feat = ['Genre','3D_available']
movies_data = pd.get_dummies(data=movies_data,columns=feat,drop_first=True)
movies_data.head(5)
Collection Marketin_expense Production_expense Multiplex_coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating Trailer_views Time_taken Twitter_hastags Avg_age_actors Num_multiplex Genre_Comedy Genre_Drama Genre_Thriller 3D_available_YES
0 11200 520.9220 91.2 0.307 33257.785 173.5 9.135 9.31 9.040 9.335 7.96 21.971145 184.240000 220.896 30 618 0 1 0 1
1 14400 304.7240 91.2 0.307 35235.365 173.5 9.120 9.33 9.095 9.305 7.96 42.477308 146.880000 201.152 50 703 1 0 0 1
2 24200 211.9142 91.2 0.307 35574.220 173.5 9.170 9.32 9.115 9.120 7.96 36.247123 108.840000 281.936 42 689 0 0 1 0
3 16600 516.0340 91.2 0.307 29713.695 169.5 9.125 9.31 9.060 9.100 6.96 46.635871 157.391498 301.328 40 677 0 0 1 1
4 17000 850.5840 91.2 0.307 30724.705 158.9 9.050 9.22 9.185 9.330 7.96 22.648871 169.400000 221.360 56 615 1 0 0 0
movies_data.corr()
Collection Marketin_expense Production_expense Multiplex_coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating Trailer_views Time_taken Twitter_hastags Avg_age_actors Num_multiplex Genre_Comedy Genre_Drama Genre_Thriller 3D_available_YES
Collection 1.000000 -0.409048 -0.484754 0.429300 0.696304 -0.377999 -0.251355 -0.249459 -0.246650 -0.248200 0.341288 0.765323 0.110005 0.023122 -0.047426 -0.391729 -0.077478 0.036233 0.071751 0.182867
Marketin_expense -0.409048 1.000000 0.432125 -0.447478 -0.242900 0.374271 0.402649 0.401933 0.402682 0.398642 -0.191898 -0.395998 0.020817 0.013665 0.071444 0.405228 0.059571 -0.013189 -0.035181 -0.098717
Production_expense -0.484754 0.432125 1.000000 -0.763651 -0.391676 0.644779 0.706481 0.707956 0.707566 0.705819 -0.251565 -0.589393 0.015773 -0.000839 0.055810 0.707559 0.086958 -0.026590 -0.098976 -0.115401
Multiplex_coverage 0.429300 -0.447478 -0.763651 1.000000 0.302188 -0.731470 -0.768589 -0.769724 -0.769157 -0.764873 0.145555 0.565641 0.035515 0.004882 -0.092104 -0.915495 -0.068554 0.046393 0.037772 0.073903
Budget 0.696304 -0.242900 -0.391676 0.302188 1.000000 -0.240265 -0.208464 -0.203981 -0.201907 -0.205397 0.232361 0.621862 0.040439 0.030674 -0.064694 -0.282796 -0.052579 -0.004195 0.046251 0.163774
Movie_length -0.377999 0.374271 0.644779 -0.731470 -0.240265 1.000000 0.746904 0.746493 0.747021 0.746707 -0.217830 -0.597070 -0.019820 0.009380 0.075198 0.673896 0.092693 0.003452 -0.088609 0.005101
Lead_ Actor_Rating -0.251355 0.402649 0.706481 -0.768589 -0.208464 0.746904 1.000000 0.997905 0.997735 0.994073 -0.169978 -0.472630 0.038050 0.014463 0.036794 0.706331 0.044592 -0.035171 -0.030763 -0.025208
Lead_Actress_rating -0.249459 0.401933 0.707956 -0.769724 -0.203981 0.746493 0.997905 1.000000 0.998097 0.994003 -0.165992 -0.471097 0.037975 0.010239 0.038005 0.708257 0.046974 -0.038965 -0.030566 -0.020056
Director_rating -0.246650 0.402682 0.707566 -0.769157 -0.201907 0.747021 0.997735 0.998097 1.000000 0.994126 -0.166638 -0.468861 0.035881 0.010077 0.041470 0.709364 0.046268 -0.033510 -0.033634 -0.020195
Producer_rating -0.248200 0.398642 0.705819 -0.764873 -0.205397 0.746707 0.994073 0.994003 0.994126 1.000000 -0.167003 -0.471498 0.028695 0.005850 0.032542 0.703518 0.051274 -0.031696 -0.033829 -0.020022
Critic_rating 0.341288 -0.191898 -0.251565 0.145555 0.232361 -0.217830 -0.169978 -0.165992 -0.166638 -0.167003 1.000000 0.273364 -0.014762 -0.023655 -0.049797 -0.128769 -0.015253 0.057177 -0.037129 0.039235
Trailer_views 0.765323 -0.395998 -0.589393 0.565641 0.621862 -0.597070 -0.472630 -0.471097 -0.468861 -0.471498 0.273364 1.000000 0.076065 0.025024 -0.039545 -0.532687 -0.109355 0.010627 0.117332 0.093246
Time_taken 0.110005 0.020817 0.015773 0.035515 0.040439 -0.019820 0.038050 0.037975 0.035881 0.028695 -0.014762 0.076065 1.000000 -0.006382 0.072049 -0.056704 0.012908 0.049285 -0.098138 -0.024431
Twitter_hastags 0.023122 0.013665 -0.000839 0.004882 0.030674 0.009380 0.014463 0.010239 0.010077 0.005850 -0.023655 0.025024 -0.006382 1.000000 -0.004840 0.006255 0.034407 0.036442 -0.058431 -0.066012
Avg_age_actors -0.047426 0.071444 0.055810 -0.092104 -0.064694 0.075198 0.036794 0.038005 0.041470 0.032542 -0.049797 -0.039545 0.072049 -0.004840 1.000000 0.078811 -0.030584 -0.015918 -0.036611 -0.013581
Num_multiplex -0.391729 0.405228 0.707559 -0.915495 -0.282796 0.673896 0.706331 0.708257 0.709364 0.703518 -0.128769 -0.532687 -0.056704 0.006255 0.078811 1.000000 0.070720 -0.035126 -0.048863 -0.052262
Genre_Comedy -0.077478 0.059571 0.086958 -0.068554 -0.052579 0.092693 0.044592 0.046974 0.046268 0.051274 -0.015253 -0.109355 0.012908 0.034407 -0.030584 0.070720 1.000000 -0.323621 -0.500192 0.004617
Genre_Drama 0.036233 -0.013189 -0.026590 0.046393 -0.004195 0.003452 -0.035171 -0.038965 -0.033510 -0.031696 0.057177 0.010627 0.049285 0.036442 -0.015918 -0.035126 -0.323621 1.000000 -0.366563 0.035491
Genre_Thriller 0.071751 -0.035181 -0.098976 0.037772 0.046251 -0.088609 -0.030763 -0.030566 -0.033634 -0.033829 -0.037129 0.117332 -0.098138 -0.058431 -0.036611 -0.048863 -0.500192 -0.366563 1.000000 0.017341
3D_available_YES 0.182867 -0.098717 -0.115401 0.073903 0.163774 0.005101 -0.025208 -0.020056 -0.020195 -0.020022 0.039235 0.093246 -0.024431 -0.066012 -0.013581 -0.052262 0.004617 0.035491 0.017341 1.000000
plt.figure(figsize=(18,10))
sns.heatmap(movies_data.corr(),annot=True, cmap = 'viridis')

Following seem to be highly correlated with each other indicating they are not truly independent variables:

  • Num_multiplex and Multiplex coverage
  • Lead_Actress_Rating and Lead_Actor_Rating
  • Director_Rating and Lead_Actor_Rating
  • Producer_Rating and Lead_Actor_Rating

We need to remove one from each pair to avoid the issue of multi-collinearity.

movies_data.corr()['Collection']
Collection             1.000000
Marketin_expense      -0.409048
Production_expense    -0.484754
Multiplex_coverage     0.429300
Budget                 0.696304
Movie_length          -0.377999
Lead_ Actor_Rating    -0.251355
Lead_Actress_rating   -0.249459
Director_rating       -0.246650
Producer_rating       -0.248200
Critic_rating          0.341288
Trailer_views          0.765323
Time_taken             0.110005
Twitter_hastags        0.023122
Avg_age_actors        -0.047426
Num_multiplex         -0.391729
Genre_Comedy          -0.077478
Genre_Drama            0.036233
Genre_Thriller         0.071751
3D_available_YES       0.182867
Name: Collection, dtype: float64
del movies_data['Num_multiplex']
del movies_data['Lead_Actress_rating']
del movies_data['Director_rating']
del movies_data['Producer_rating']

Train Test Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(movies_data.drop('Collection',axis=1), movies_data['Collection'],
                                                    test_size=0.3, random_state=101)

Train model - Linear regression, Ridge Regression and Lasso Regression

from sklearn.linear_model import LinearRegression, Ridge, Lasso
lm_linear = LinearRegression()
lm_linear.fit(X_train,y_train)
LinearRegression()
pd.DataFrame(lm_linear.coef_,movies_data.columns.drop('Collection'),columns=['Coefficients'])
Coefficients
Marketin_expense -17.660152
Production_expense -38.573837
Multiplex_coverage 21164.321821
Budget 1.592359
Movie_length 7.722789
Lead_ Actor_Rating 4346.690855
Critic_rating 3122.433825
Trailer_views 151.296125
Time_taken 42.416747
Twitter_hastags 0.832483
Avg_age_actors 28.492446
Genre_Comedy 3320.151878
Genre_Drama 3573.406719
Genre_Thriller 3245.247598
3D_available_YES 2526.395364
predict_linear = lm_linear.predict(X_test)
sns.jointplot(x=y_test,y=predict_linear)

We need to standardize data to be used with Ridge and Lasso regression. Also, we need to find an optimum value for the tuning parameter.

from sklearn.model_selection import validation_curve
from sklearn.preprocessing import StandardScaler
#scaling and transforming data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
#Performing cross validation to get best alpha value
param_alpha = np.logspace(-2,8,100)
train_scores, test_scores = validation_curve(Ridge(),X_train_scaled,y_train,"alpha",param_alpha,scoring='r2')
/Users/vanya/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py:68: FutureWarning: Pass param_name=alpha, param_range=[1.00000000e-02 1.26185688e-02 1.59228279e-02 2.00923300e-02
 2.53536449e-02 3.19926714e-02 4.03701726e-02 5.09413801e-02
 6.42807312e-02 8.11130831e-02 1.02353102e-01 1.29154967e-01
 1.62975083e-01 2.05651231e-01 2.59502421e-01 3.27454916e-01
 4.13201240e-01 5.21400829e-01 6.57933225e-01 8.30217568e-01
 1.04761575e+00 1.32194115e+00 1.66810054e+00 2.10490414e+00
 2.65608778e+00 3.35160265e+00 4.22924287e+00 5.33669923e+00
 6.73415066e+00 8.49753436e+00 1.07226722e+01 1.35304777e+01
 1.70735265e+01 2.15443469e+01 2.71858824e+01 3.43046929e+01
 4.32876128e+01 5.46227722e+01 6.89261210e+01 8.69749003e+01
 1.09749877e+02 1.38488637e+02 1.74752840e+02 2.20513074e+02
 2.78255940e+02 3.51119173e+02 4.43062146e+02 5.59081018e+02
 7.05480231e+02 8.90215085e+02 1.12332403e+03 1.41747416e+03
 1.78864953e+03 2.25701972e+03 2.84803587e+03 3.59381366e+03
 4.53487851e+03 5.72236766e+03 7.22080902e+03 9.11162756e+03
 1.14975700e+04 1.45082878e+04 1.83073828e+04 2.31012970e+04
 2.91505306e+04 3.67837977e+04 4.64158883e+04 5.85702082e+04
 7.39072203e+04 9.32603347e+04 1.17681195e+05 1.48496826e+05
 1.87381742e+05 2.36448941e+05 2.98364724e+05 3.76493581e+05
 4.75081016e+05 5.99484250e+05 7.56463328e+05 9.54548457e+05
 1.20450354e+06 1.51991108e+06 1.91791026e+06 2.42012826e+06
 3.05385551e+06 3.85352859e+06 4.86260158e+06 6.13590727e+06
 7.74263683e+06 9.77009957e+06 1.23284674e+07 1.55567614e+07
 1.96304065e+07 2.47707636e+07 3.12571585e+07 3.94420606e+07
 4.97702356e+07 6.28029144e+07 7.92482898e+07 1.00000000e+08] as keyword args. From version 0.25 passing these as positional arguments will result in an error
  warnings.warn("Pass {} as keyword args. From version 0.25 "
test_mean = test_scores.mean(axis=1)
test_mean
array([ 0.66295654,  0.66295759,  0.66295891,  0.66296058,  0.66296268,
        0.66296533,  0.66296868,  0.66297289,  0.66297821,  0.66298491,
        0.66299335,  0.66300397,  0.66301734,  0.66303415,  0.66305528,
        0.66308179,  0.66311501,  0.66315656,  0.66320842,  0.66327294,
        0.66335292,  0.66345159,  0.66357257,  0.66371973,  0.66389691,
        0.66410742,  0.6643531 ,  0.66463302,  0.66494138,  0.66526465,
        0.6655776 ,  0.66583833,  0.66598227,  0.66591539,  0.6655071 ,
        0.66458335,  0.66292045,  0.66024053,  0.65620961,  0.65044075,
        0.64250523,  0.63195527,  0.61835943,  0.60134782,  0.5806585 ,
        0.55617433,  0.52794199,  0.49617368,  0.46123997,  0.42366408,
        0.38411958,  0.34342169,  0.30249646,  0.26231859,  0.22382464,
        0.18782195,  0.15491768,  0.1254839 ,  0.09966124,  0.07739249,
        0.05847218,  0.04259934,  0.02942417,  0.0185845 ,  0.009731  ,
        0.0025426 , -0.00326576, -0.00794076, -0.01169177, -0.01469385,
       -0.01709169, -0.01900384, -0.02052671, -0.02173833, -0.02270153,
       -0.02346674, -0.02407435, -0.02455663, -0.02493929, -0.02524285,
       -0.0254836 , -0.02567451, -0.02582587, -0.02594587, -0.026041  ,
       -0.02611641, -0.02617618, -0.02622355, -0.0262611 , -0.02629086,
       -0.02631444, -0.02633313, -0.02634795, -0.02635969, -0.02636899,
       -0.02637636, -0.02638221, -0.02638684, -0.02639051, -0.02639342])

Best value for alpha from our range will be with max R2 value.

np.where(test_mean == test_mean.max())[0][0]
32
lm_ridge = Ridge(alpha=param_alpha[32])
lm_ridge.fit(X_train_scaled,y_train)
predict_ridge = lm_ridge.predict(X_test_scaled)
sns.jointplot(x=y_test,y=predict_ridge)

train_scores, test_scores = validation_curve(Lasso(),X_train_scaled,y_train,"alpha",param_alpha,scoring='r2')
/Users/vanya/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py:68: FutureWarning: Pass param_name=alpha, param_range=[1.00000000e-02 1.26185688e-02 1.59228279e-02 2.00923300e-02
 2.53536449e-02 3.19926714e-02 4.03701726e-02 5.09413801e-02
 6.42807312e-02 8.11130831e-02 1.02353102e-01 1.29154967e-01
 1.62975083e-01 2.05651231e-01 2.59502421e-01 3.27454916e-01
 4.13201240e-01 5.21400829e-01 6.57933225e-01 8.30217568e-01
 1.04761575e+00 1.32194115e+00 1.66810054e+00 2.10490414e+00
 2.65608778e+00 3.35160265e+00 4.22924287e+00 5.33669923e+00
 6.73415066e+00 8.49753436e+00 1.07226722e+01 1.35304777e+01
 1.70735265e+01 2.15443469e+01 2.71858824e+01 3.43046929e+01
 4.32876128e+01 5.46227722e+01 6.89261210e+01 8.69749003e+01
 1.09749877e+02 1.38488637e+02 1.74752840e+02 2.20513074e+02
 2.78255940e+02 3.51119173e+02 4.43062146e+02 5.59081018e+02
 7.05480231e+02 8.90215085e+02 1.12332403e+03 1.41747416e+03
 1.78864953e+03 2.25701972e+03 2.84803587e+03 3.59381366e+03
 4.53487851e+03 5.72236766e+03 7.22080902e+03 9.11162756e+03
 1.14975700e+04 1.45082878e+04 1.83073828e+04 2.31012970e+04
 2.91505306e+04 3.67837977e+04 4.64158883e+04 5.85702082e+04
 7.39072203e+04 9.32603347e+04 1.17681195e+05 1.48496826e+05
 1.87381742e+05 2.36448941e+05 2.98364724e+05 3.76493581e+05
 4.75081016e+05 5.99484250e+05 7.56463328e+05 9.54548457e+05
 1.20450354e+06 1.51991108e+06 1.91791026e+06 2.42012826e+06
 3.05385551e+06 3.85352859e+06 4.86260158e+06 6.13590727e+06
 7.74263683e+06 9.77009957e+06 1.23284674e+07 1.55567614e+07
 1.96304065e+07 2.47707636e+07 3.12571585e+07 3.94420606e+07
 4.97702356e+07 6.28029144e+07 7.92482898e+07 1.00000000e+08] as keyword args. From version 0.25 passing these as positional arguments will result in an error
  warnings.warn("Pass {} as keyword args. From version 0.25 "
test_mean = test_scores.mean(axis=1)
lm_lasso = Lasso(alpha=param_alpha[np.where(test_mean==test_mean.max())[0][0]])
lm_lasso.fit(X_train_scaled,y_train)
predict_lasso = lm_lasso.predict(X_test_scaled)
sns.jointplot(x=y_test,y=predict_lasso)

Evaluating the model

from sklearn.metrics import r2_score, mean_squared_error
print("R2 Score --> higher is better")
print("Linear:", r2_score(y_test,predict_linear))
print("Ridge:", r2_score(y_test,predict_ridge))
print("Lasso:", r2_score(y_test,predict_lasso))
R2 Score --> higher is better
Linear: 0.7468007748323722
Ridge: 0.7569419920394107
Lasso: 0.7571808357758425
print("Root mean square error --> lower is better")
print("Linear:", np.sqrt(mean_squared_error(y_test,predict_linear)))
print("Ridge:", np.sqrt(mean_squared_error(y_test,predict_ridge)))
print("Lasso:", np.sqrt(mean_squared_error(y_test,predict_lasso)))
Root mean square error --> lower is better
Linear: 9194.62359790369
Ridge: 9008.608966959373
Lasso: 9004.181672649082

Result: We were able to train our model using the data avaliable to determine movie collection using Linear, Ridge and Lasso regression techniques. As per the result, all three are quite close in R2 score and RMSE with Lasso being the best. Owing to the small size of the dataset, the results are very close.