Determine neighbourhood to open new restaurant using clustering

25 minute read

Objective: Determine the neighbourhood to open a new restaurant in order to expand business.

We want to open a new restaurant in New York similar to one we have in San Francisco. Firstly, we need to shortlist a few places where we can open up our new restaurant. We’ll perform K-Means Clustering in order to determine the place closest to our current location in terms of nearby venues. We’ll be using FourSquare API data and some web scraping to get details on the list of neighbourhoods in New York City.

The idea behind this analysis can be extended to many other items like opening a new office, play center, buying a house etc.

#importing libraries to be used
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#view plots in jupyter notebook
%matplotlib inline
sns.set_style('whitegrid') #setting style for plots, optional

#Libraries for Gepgraphical identification of location
from geopy.geocoders import Nominatim

# Library for using Kmeans method for clustering
from sklearn.cluster import KMeans

# Libraries to handle requests
import requests
from pandas.io.json import json_normalize

# Libraries to plot and visualize locations on maps and also plotting other kmeans related data
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium

# Liraries to import data from website - Web Scraping
import seaborn as sns
from bs4 import BeautifulSoup as BS

Putting in details of current restaurant location in San Francisco

SF_restaurant = "Octavia St, San Francisco, CA 94102, United States"

# Getting the Lat-Lon of the office
geolocator = Nominatim(user_agent="USA_explorer")
SF_res_location = geolocator.geocode(SF_restaurant)
SF_latitude = SF_res_location.latitude
SF_longitude = SF_res_location.longitude
print('The geograpical coordinate are {}, {}.'.format(SF_latitude, SF_longitude))
The geograpical coordinate are 37.7780777, -122.424924.

Populating New York City Neighbourhood information from: https://www.baruch.cuny.edu/nycdata/population-geography/neighborhoods.htm

URL = "https://www.baruch.cuny.edu/nycdata/population-geography/neighborhoods.htm"
r = requests.get(URL, verify = False)

soup = BS(r.text, "html.parser")
data = soup.find_all("tr")

start_index = 0
for i in range (len(data)):
    td = data[i].find_all("td")
    for j in range (len(td)):
        if td[j].text == "Brooklyn":
            start_index = i
            break
    if start_index != 0:
        break

end_index = 0
for i in range (len(data)-1,0,-1):
    td = data[i].find_all("td")
    for j in range (len(td)):
        if td[j].text == "Woodside":
            end_index = i
            break
    if end_index != 0:
        break

list1 = []
list2 = []
list3 = []
list4 = []
list5 = []
for i in range (start_index,end_index+1):
    td = data[i].find_all("td")
    list1.append(td[1].text)
    list2.append(td[2].text)
    list3.append(td[3].text)
    list4.append(td[4].text)
    list5.append(td[5].text)

final = []
final.append(list1)
final.append(list2)
final.append(list3)
final.append(list4)
final.append(list5)

df = pd.DataFrame(final)

df = df.transpose()


final_df = pd.DataFrame(columns=['Borough','Neighbourhood'])

for i in range (5):
    d = {}
    d = {'Borough':df[i][0]}
    for j in range (1,len(df)):
        if df[i][j]=='\xa0':
            break
        else:
            d['Neighbourhood'] = df[i][j]
            final_df = final_df.append(d,ignore_index=True)
final_df
Borough Neighbourhood
0 Brooklyn Bath Beach
1 Brooklyn Bay Ridge
2 Brooklyn Bedford Stuyvesant
3 Brooklyn Bensonhurst
4 Brooklyn Bergen Beach
... ... ...
324 Staten Island Ward Hill
325 Staten Island West Brighton
326 Staten Island Westerleigh
327 Staten Island Willowbrook
328 Staten Island Woodrow

329 rows × 2 columns

Adding Lattitude and Longitude information for each Neighbourhoods

final_df['Latitude']=""
final_df['Longitude']=""

for i in range(len(final_df)):
    nyadd=str(final_df['Neighbourhood'][i])+', '+str(final_df['Borough'][i])+', New York'

    geolocator = Nominatim(user_agent="USA_explorer")
    location = geolocator.geocode(nyadd)
    try:
        latitude = location.latitude
        longitude = location.longitude
    except:
        latitude=1000 # For those neighbourhoods whose latitude and longitude could not be fetched
        longitude=1000
    final_df['Latitude'][i]=latitude
    final_df['Longitude'][i]=longitude

final_df
Borough Neighbourhood Latitude Longitude
0 Brooklyn Bath Beach 40.6018 -74.0005
1 Brooklyn Bay Ridge 40.634 -74.0146
2 Brooklyn Bedford Stuyvesant 40.6834 -73.9412
3 Brooklyn Bensonhurst 40.605 -73.9934
4 Brooklyn Bergen Beach 40.6204 -73.9068
... ... ... ... ...
324 Staten Island Ward Hill 40.6329 -74.0829
325 Staten Island West Brighton 1000 1000
326 Staten Island Westerleigh 40.6212 -74.1318
327 Staten Island Willowbrook 40.6032 -74.1385
328 Staten Island Woodrow 40.5434 -74.1976

329 rows × 4 columns

Cleaning the dataset fetched from the external URL

final_df=final_df[final_df.Latitude!=1000]
final_df.reset_index(inplace=True)
final_df.drop('index',axis=1,inplace=True)
final_df
Borough Neighbourhood Latitude Longitude
0 Brooklyn Bath Beach 40.6018 -74.0005
1 Brooklyn Bay Ridge 40.634 -74.0146
2 Brooklyn Bedford Stuyvesant 40.6834 -73.9412
3 Brooklyn Bensonhurst 40.605 -73.9934
4 Brooklyn Bergen Beach 40.6204 -73.9068
... ... ... ... ...
311 Staten Island Travis 40.5932 -74.1879
312 Staten Island Ward Hill 40.6329 -74.0829
313 Staten Island Westerleigh 40.6212 -74.1318
314 Staten Island Willowbrook 40.6032 -74.1385
315 Staten Island Woodrow 40.5434 -74.1976

316 rows × 4 columns

Adding in the location of the current restaurant (in San Francisco) so that it is also used in clustering along with NYC neighbourhoods

SF_rest_add={'Borough': 'Hayes Valley, SF','Neighbourhood':'Hayes Valley','Latitude':SF_latitude,'Longitude':SF_longitude}
final_df=final_df.append(SF_rest_add,ignore_index=True)
final_df.iloc[[-1]]
Borough Neighbourhood Latitude Longitude
316 Hayes Valley, SF Hayes Valley 37.7781 -122.425

Clustering the neighbourhoods of New York including the neighbourhood of San Francisco

Defining FourSquare credentials

CLIENT_ID = '*****************************************'
CLIENT_SECRET = '**************************************'
VERSION = '20180605'
LIMIT = 100

Defining a function to get the venues from all neighbourhoods

def getNearbyVenues(borough, names, latitudes, longitudes, radius=500):

    venues_list=[]
    for borough, name, lat, lng in zip(borough, names, latitudes, longitudes):

        # API request URL creation
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            LIMIT)

        # making requests for the URL
        results = requests.get(url).json()["response"]['groups'][0]['items']

        # Returning only relevant information for each nearby venue
        venues_list.append([(
            borough,
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Borough','Neighborhood',
                  'Neighborhood Latitude',
                  'Neighborhood Longitude',
                  'Venue',
                  'Venue Latitude',
                  'Venue Longitude',
                  'Venue Category']

    return(nearby_venues)
# Getting the venues for each neighbourhoods
NewYork_venues = getNearbyVenues(borough=final_df['Borough'],names=final_df['Neighbourhood'],
                                   latitudes=final_df['Latitude'],
                                   longitudes=final_df['Longitude']
                                  )
# Looking at the data received from FourSquare
NewYork_venues.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10720 entries, 0 to 10719
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Borough                 10720 non-null  object
 1   Neighborhood            10720 non-null  object
 2   Neighborhood Latitude   10720 non-null  float64
 3   Neighborhood Longitude  10720 non-null  float64
 4   Venue                   10720 non-null  object
 5   Venue Latitude          10720 non-null  float64
 6   Venue Longitude         10720 non-null  float64
 7   Venue Category          10720 non-null  object
dtypes: float64(4), object(4)
memory usage: 670.1+ KB
NewYork_venues.head()
Borough Neighborhood Neighborhood Latitude Neighborhood Longitude Venue Venue Latitude Venue Longitude Venue Category
0 Brooklyn Bath Beach 40.60185 -74.000501 Lenny's Pizza 40.604908 -73.998713 Pizza Place
1 Brooklyn Bath Beach 40.60185 -74.000501 King's Kitchen 40.603844 -73.996960 Cantonese Restaurant
2 Brooklyn Bath Beach 40.60185 -74.000501 Delacqua 40.604216 -73.997452 Spa
3 Brooklyn Bath Beach 40.60185 -74.000501 Lutzina Bar&Lounge 40.600807 -74.000578 Hookah Bar
4 Brooklyn Bath Beach 40.60185 -74.000501 Planet Fitness 40.604567 -73.997861 Gym / Fitness Center
NewYork_venues["Venue Category"].unique()
array(['Pizza Place', 'Cantonese Restaurant', 'Spa', 'Hookah Bar',
       'Gym / Fitness Center', 'Dessert Shop', 'Chinese Restaurant',
       'Bakery', 'Italian Restaurant', 'Coffee Shop', 'Restaurant',
       'Japanese Restaurant', 'Supplement Shop',
       'Eastern European Restaurant', 'Dim Sum Restaurant', 'Tea Room',
       'Ice Cream Shop', 'Peruvian Restaurant', 'Sandwich Place', 'Bank',
       'American Restaurant', 'Shanghai Restaurant', 'Mobile Phone Shop',
       'Kids Store', 'Gas Station', 'Middle Eastern Restaurant',
       'Seafood Restaurant', 'Tennis Court', 'Vietnamese Restaurant',
       'Noodle House', 'Rental Car Location', 'Park',
       'Fried Chicken Joint', 'Hotpot Restaurant', 'Gift Shop',
       'Irish Pub', 'Malay Restaurant', 'Bar', 'Playground', 'Donut Shop',
       'Bubble Tea Shop', 'Nightclub', 'Cocktail Bar',
       'New American Restaurant', 'Wine Shop', 'Boutique', 'Tiki Bar',
       'Café', 'Taco Place', 'Mexican Restaurant', 'Gym', 'Wine Bar',
       'Gourmet Shop', 'Bagel Shop', 'Lounge', 'Food', 'Deli / Bodega',
       'Thrift / Vintage Store', 'Garden',
       'Southern / Soul Food Restaurant', 'Caribbean Restaurant',
       'Discount Store', 'Pharmacy', 'Farmers Market', 'Cosmetics Shop',
       'Turkish Restaurant', 'Sushi Restaurant', 'Fast Food Restaurant',
       'Salon / Barbershop', 'Video Game Store', 'Frozen Yogurt Shop',
       'Asian Restaurant', 'Shoe Store', 'Clothing Store',
       "Women's Store", 'Optical Shop', 'Accessories Store',
       'Supermarket', 'Bus Station', 'Furniture / Home Store',
       'Concert Hall', 'Antique Shop', 'Yoga Studio',
       'Martial Arts School', 'Grocery Store', 'Indian Restaurant',
       'Athletics & Sports', 'Burrito Place', 'Music Venue',
       'Jewelry Store', 'French Restaurant', 'Thai Restaurant',
       'Flower Shop', 'Dance Studio', 'Bookstore', "Men's Store",
       'Theater', 'Korean Restaurant', 'Music Store',
       'Health & Beauty Service', 'Cajun / Creole Restaurant',
       'Garden Center', 'Arts & Crafts Store', 'Boxing Gym',
       'Electronics Store', 'Dry Cleaner', 'Gastropub', 'Historic Site',
       'Juice Bar', 'Burger Joint', 'Convenience Store',
       'Bed & Breakfast', 'Hotel', 'Bistro', 'Bike Shop', 'Neighborhood',
       'Russian Restaurant', 'Mediterranean Restaurant',
       'Food & Drink Shop', 'Other Great Outdoors', 'Food Truck',
       'Non-Profit', 'Diner', 'Varenyky restaurant', 'Karaoke Bar',
       'Pool', 'Bus Line', 'Tunnel', 'Recording Studio', 'History Museum',
       'Pet Store', 'Scenic Lookout', 'Beach', 'Falafel Restaurant',
       'Pier', 'Indie Theater', 'Pilates Studio', 'Ramen Restaurant',
       'Pub', 'Plaza', 'Chocolate Shop', 'Mattress Store',
       'Spanish Restaurant', 'Moving Target', 'Vape Store',
       'Fruit & Vegetable Store', 'Liquor Store', 'Lawyer',
       'Metro Station', 'Bus Stop', 'Greek Restaurant', 'Record Shop',
       'Beer Garden', 'Butcher', 'Event Space', 'Gaming Cafe',
       'Herbs & Spices Store', 'Church', 'Filipino Restaurant',
       'Latin American Restaurant', 'Wings Joint', 'Breakfast Spot',
       'Brewery', 'Fish Market', 'Art Gallery', 'African Restaurant',
       'Photography Studio', 'Sculpture Garden',
       'Vegetarian / Vegan Restaurant', 'Pie Shop', 'Market',
       'Waterfront', 'Ethiopian Restaurant', 'Yemeni Restaurant',
       'Dumpling Restaurant', 'Indie Movie Theater',
       'Sporting Goods Shop', 'Toy / Game Store', 'Speakeasy', 'Dive Bar',
       'Harbor / Marina', 'Basketball Court', 'Candy Store', 'Museum',
       'Drugstore', 'Department Store', 'Gun Range', 'Tibetan Restaurant',
       'Health Food Store', 'Nail Salon', 'Tapas Restaurant',
       'Salad Place', 'Climbing Gym', 'Theme Park Ride / Attraction',
       'Roof Deck', 'Dog Run', 'Food Court', 'Trail', 'Boat or Ferry',
       'Performing Arts Venue', 'Entertainment Service', 'Hotel Bar',
       'Intersection', 'Poke Place', 'Moroccan Restaurant',
       'Massage Studio', 'Flea Market', 'Cycle Studio', 'Perfume Shop',
       'Residential Building (Apartment / Condo)', 'Whisky Bar', 'Winery',
       'Factory', 'Miscellaneous Shop', 'Hobby Shop', 'High School',
       'Rental Service', 'BBQ Joint', 'Israeli Restaurant', 'Opera House',
       'German Restaurant', 'Beer Bar', 'Cupcake Shop', 'Steakhouse',
       'Shipping Store', 'Board Shop', 'Bridge', 'Shopping Mall',
       'Child Care Service', 'Skate Park', 'Soccer Field',
       'Cuban Restaurant', 'Indoor Play Area', 'Baseball Field',
       "Doctor's Office", 'Jewish Restaurant', 'Polish Restaurant',
       'Sports Bar', 'Track', 'Cheese Shop', 'Bowling Alley',
       'Laundromat', 'Austrian Restaurant', 'Organic Grocery', 'Farm',
       'Gymnastics Gym', 'Halal Restaurant', 'Big Box Store',
       'Kosher Restaurant', 'Tourist Information Center', 'Film Studio',
       'IT Services', 'School', 'Comic Shop', 'Gym Pool',
       'Colombian Restaurant', 'Soup Place', 'Used Bookstore',
       'Business Service', 'North Indian Restaurant', 'Other Nightlife',
       'Public Art', 'Field', 'Picnic Shelter', 'Waterfall',
       'Amphitheater', 'Bike Trail', 'Hill', 'Snack Place', 'Sports Club',
       'Video Store', 'Paper / Office Supplies Store', 'Lake',
       'General Travel', 'Comfort Food Restaurant', 'Creperie',
       'Szechuan Restaurant', 'Stadium', 'Community Center',
       'Arepa Restaurant', 'Brazilian Restaurant', 'Football Stadium',
       'Laundry Service', 'Theme Park', 'Aquarium', 'Exhibit', 'Arcade',
       'Movie Theater', 'Beer Store', 'Udon Restaurant',
       'South American Restaurant', 'Hardware Store', 'Gay Bar',
       'Outdoor Gym', 'Picnic Area', 'Storage Facility', 'Tattoo Parlor',
       'Smoke Shop', 'Piano Bar', 'Train Station', 'Print Shop',
       'Pool Hall', 'Zoo', 'Zoo Exhibit', 'Souvenir Shop',
       'Warehouse Store', 'Check Cashing Service', 'Post Office',
       'Jazz Club', 'Puerto Rican Restaurant', 'Eye Doctor', 'River',
       'Outlet Store', 'Waste Facility', 'Tennis Stadium', 'Canal',
       'Recreation Center', 'Social Club', 'Library', 'Shop & Service',
       'Distillery', 'Home Service', 'Auto Dealership',
       'Construction & Landscaping', 'Outdoors & Recreation', 'Building',
       'Cooking School', 'Memorial Site', 'Auditorium', 'Tree',
       'Lingerie Store', 'Monument / Landmark', 'Paella Restaurant',
       'Japanese Curry Restaurant', 'Lebanese Restaurant',
       'Peking Duck Restaurant', 'Art Museum', 'Smoothie Shop',
       'Argentinian Restaurant', 'Comedy Club', 'Cha Chaan Teng',
       'Taiwanese Restaurant', 'Sake Bar', 'Food Stand', 'Animal Shelter',
       'Molecular Gastronomy Restaurant', 'Medical Center',
       'Golf Driving Range', 'Outdoor Sculpture', 'Ukrainian Restaurant',
       'Soba Restaurant', 'Shabu-Shabu Restaurant',
       'Australian Restaurant', 'Coworking Space', 'Kebab Restaurant',
       'General Entertainment', 'Office', 'Tex-Mex Restaurant',
       'Fountain', 'Stationery Store', 'Adult Boutique',
       'Leather Goods Store', 'Golf Course', 'Fondue Restaurant',
       'Theme Restaurant', 'Veterinarian', 'Empanada Restaurant',
       'College Academic Building', 'Czech Restaurant', 'Club House',
       'Bridal Shop', 'Shoe Repair', 'College Arts Building', 'Circus',
       'College Bookstore', 'Kitchen Supply Store', 'Newsstand',
       'Pet Service', 'Hostel', 'Hawaiian Restaurant', 'College Theater',
       'Churrascaria', 'Skating Rink', 'Luggage Store',
       'College Cafeteria', 'Cultural Center', 'Resort', 'Watch Shop',
       'Outdoor Supply Store', 'Street Art', 'Duty-free Shop',
       'Scandinavian Restaurant', 'Pet Café', 'Swiss Restaurant',
       'Tram Station', 'Persian Restaurant', 'Bike Rental / Bike Share',
       'Tailor Shop', 'Pedestrian Plaza', 'Hot Dog Joint', 'Daycare',
       'Tanning Salon', 'Train', 'Surf Spot', 'Parking',
       'Indonesian Restaurant', 'Venezuelan Restaurant',
       'Imported Food Shop', 'Bath House', 'Fish & Chips Shop',
       'Afghan Restaurant', 'Automotive Shop', 'Beach Bar', 'Pop-Up Shop',
       'Sri Lankan Restaurant', 'Portuguese Restaurant', 'Rest Area',
       'Rock Club', 'Costume Shop', 'Government Building',
       'Airport Lounge', 'Airport Terminal', 'Airport Food Court',
       'Plane', 'Motorcycle Shop', 'Rock Climbing Spot', 'Cafeteria',
       'Auto Garage', 'Romanian Restaurant', 'Go Kart Track',
       'Professional & Other Places', 'Racetrack', 'Fishing Spot',
       'Lighthouse', 'Nightlife Spot', 'Weight Loss Center', 'Buffet',
       'Toll Plaza', 'Botanical Garden', 'Baseball Stadium',
       'Outlet Mall', 'Souvlaki Shop', 'Camera Store'], dtype=object)
NewYork_venues["Venue Category"].nunique()
443

We are getting 443 unique venues from the FourSquaredata

NewYork_venues=NewYork_venues[NewYork_venues['Venue Category']!='Neighborhood'] # Code adjusted to remove Neighborhood
# One - hot encoding to handle categorical data for clustering
NY_onehot = pd.get_dummies(data=NewYork_venues[['Borough','Neighborhood','Venue Category']],columns=['Venue Category'],drop_first=True,prefix="", prefix_sep="")
NY_onehot.head()
Borough Neighborhood Adult Boutique Afghan Restaurant African Restaurant Airport Food Court Airport Lounge Airport Terminal American Restaurant Amphitheater ... Whisky Bar Wine Bar Wine Shop Winery Wings Joint Women's Store Yemeni Restaurant Yoga Studio Zoo Zoo Exhibit
0 Brooklyn Bath Beach 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 Brooklyn Bath Beach 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 Brooklyn Bath Beach 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 Brooklyn Bath Beach 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 Brooklyn Bath Beach 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 443 columns

Getting scored of each category based on Mean of the frequency of their occurences. This will help to determine the similarities between neighbourhoods

NY_grouped = NY_onehot.groupby(['Borough','Neighborhood']).mean().reset_index()
NY_grouped.head()
Borough Neighborhood Adult Boutique Afghan Restaurant African Restaurant Airport Food Court Airport Lounge Airport Terminal American Restaurant Amphitheater ... Whisky Bar Wine Bar Wine Shop Winery Wings Joint Women's Store Yemeni Restaurant Yoga Studio Zoo Zoo Exhibit
0 Bronx Allerton 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 Bronx Bathgate 0.0 0.0 0.0 0.0 0.0 0.0 0.010000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 Bronx Baychester 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 Bronx Bedford Park 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 Bronx Belmont 0.0 0.0 0.0 0.0 0.0 0.0 0.017241 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 443 columns

Function to determine most common venues, we are cleaning and reducing the venues we are doing our analysis on to reduce noise from the data and to get more precise results

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[2:]
    row_categories_sorted = row_categories.sort_values(ascending=False)

    return row_categories_sorted.index.values[0:num_top_venues]
num_top_venues = 15 # selecting top 15 venues for our analysis

indicators = ['st', 'nd', 'rd']

# creating columns according to number of top venues
columns = ['Borough','Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# new dataframe to hold the top 10 venues for each of the neighbourhoods
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Borough']=NY_grouped['Borough']
neighborhoods_venues_sorted['Neighborhood'] = NY_grouped['Neighborhood']

# calling the function to get the top 10 venues for each neighbourhood
for ind in np.arange(NY_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 2:] = return_most_common_venues(NY_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()
Borough Neighborhood 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue 11th Most Common Venue 12th Most Common Venue 13th Most Common Venue 14th Most Common Venue 15th Most Common Venue
0 Bronx Allerton Discount Store Sandwich Place Fast Food Restaurant Pizza Place Pharmacy Donut Shop Storage Facility Bike Trail Soccer Field Seafood Restaurant Bar Bank Clothing Store Mobile Phone Shop Trail
1 Bronx Bathgate Italian Restaurant Pizza Place Deli / Bodega Spanish Restaurant Liquor Store Bank Bakery Grocery Store Dessert Shop Mexican Restaurant Sandwich Place Food & Drink Shop Coffee Shop Shoe Store Donut Shop
2 Bronx Baychester Pharmacy Italian Restaurant Grocery Store Bike Trail Liquor Store Historic Site Pizza Place Sandwich Place Donut Shop Print Shop Mobile Phone Shop Bus Station Bus Line Deli / Bodega Playground
3 Bronx Bedford Park Diner Pizza Place Deli / Bodega Mexican Restaurant Supermarket Pharmacy Chinese Restaurant Sandwich Place Spanish Restaurant Bus Station Grocery Store Food Truck Smoke Shop Baseball Field Bakery
4 Bronx Belmont Italian Restaurant Pizza Place Bakery Deli / Bodega Dessert Shop Restaurant Fish Market Food & Drink Shop Cheese Shop Chinese Restaurant Tattoo Parlor Grocery Store Mexican Restaurant Fast Food Restaurant Mediterranean Restaurant

Determining the K value (using elbow method)

K=range(1,25)

NY_grouped_clustering = NY_grouped.drop(['Borough','Neighborhood'], 1)
WCSS=[] # Model performance indicator --- Within Cluster Sum of Squares

for k in K:
    kmeans = KMeans(n_clusters=k, random_state=0).fit(NY_grouped_clustering)
    WCSS.append(kmeans.inertia_)

print (WCSS)
[30.047043953503767, 28.142337991544935, 26.84013025212907, 25.956055283198904, 24.8308555411957, 24.19646209527639, 23.195158562596642, 22.935235144735433, 22.330608681361483, 22.097341077068755, 21.759179057646183, 21.06650654574184, 20.813200614316383, 20.438676231526742, 19.913500964265413, 19.51207544053955, 19.414458087213852, 19.1960151544852, 18.751570710040756, 18.449575691630375, 18.142105128663783, 17.802439692856836, 17.593415639791107, 17.267414917310226]

We have used the Within cluster sum of squares value (intertia) in order to determine the best possible value of k to be used in our analysis

#Plotting the graph of K vs WCSS to determine "k"
plt.figure(figsize=(20,10))
plt.plot(K,WCSS)
plt.xlabel("k")
plt.ylabel("Sum of Squares")
plt.title("Determining K-value")
plt.show()

We’ll be using the value of k as 7 as per the above graph. This seems to be the closest point of deflection, though there’s no clear cut point for our analysis.

Running the clustering algorithm

k = 7
kmeans = KMeans(n_clusters=k, random_state=0).fit(NY_grouped_clustering)
# adding clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

NY_clustered = final_df

# Adding latitude/longitude for each neighborhood with the cluster labels
NY_clustered = NY_clustered.merge(neighborhoods_venues_sorted.set_index(['Borough','Neighborhood']), left_on=['Borough','Neighbourhood'],right_on=['Borough','Neighborhood'])

NY_clustered.head()
Borough Neighbourhood Latitude Longitude Cluster Labels 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue 11th Most Common Venue 12th Most Common Venue 13th Most Common Venue 14th Most Common Venue 15th Most Common Venue
0 Brooklyn Bath Beach 40.6018 -74.0005 1 Chinese Restaurant Cantonese Restaurant Supplement Shop Pizza Place Bank Italian Restaurant Japanese Restaurant Dessert Shop Gas Station Bakery Tea Room Sandwich Place Eastern European Restaurant Peruvian Restaurant Middle Eastern Restaurant
1 Brooklyn Bay Ridge 40.634 -74.0146 2 Chinese Restaurant Dessert Shop Seafood Restaurant Playground Irish Pub Vietnamese Restaurant Bubble Tea Shop Noodle House Nightclub Tea Room Tennis Court Gift Shop Park Fried Chicken Joint Malay Restaurant
2 Brooklyn Bedford Stuyvesant 40.6834 -73.9412 2 Coffee Shop Pizza Place Café Bar Fried Chicken Joint Deli / Bodega Playground Gym Lounge Gym / Fitness Center Cocktail Bar Tiki Bar Seafood Restaurant Thrift / Vintage Store Gourmet Shop
3 Brooklyn Bensonhurst 40.605 -73.9934 1 Chinese Restaurant Bakery Bank Cantonese Restaurant Japanese Restaurant Mobile Phone Shop Bubble Tea Shop Pizza Place Supplement Shop Kids Store Gourmet Shop Coffee Shop Pharmacy Turkish Restaurant Sushi Restaurant
4 Brooklyn Bergen Beach 40.6204 -73.9068 1 Chinese Restaurant American Restaurant Donut Shop Pizza Place Bus Station Gym Peruvian Restaurant Sushi Restaurant Supermarket Deli / Bodega Italian Restaurant Field Flea Market Filipino Restaurant Film Studio

Here’s the fun part, visualize the clusters in New York City!

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(NY_clustered['Latitude'], NY_clustered['Longitude'], NY_clustered['Neighbourhood'], NY_clustered['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters

Let’s take a closer look

Zooming in a little more

Getting the cluster for our current restaurant

NY_clustered.loc[NY_clustered['Neighbourhood'] == "Hayes Valley"]
Borough Neighbourhood Latitude Longitude Cluster Labels 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue 11th Most Common Venue 12th Most Common Venue 13th Most Common Venue 14th Most Common Venue 15th Most Common Venue
315 Hayes Valley, SF Hayes Valley 37.7781 -122.425 2 Clothing Store Wine Bar French Restaurant Boutique Mexican Restaurant Pizza Place Performing Arts Venue Cocktail Bar Sushi Restaurant Coffee Shop Park Bakery Optical Shop Café Juice Bar
curr_rest_cluster = NY_clustered.loc[NY_clustered['Neighbourhood'] == "Hayes Valley","Cluster Labels"].item()
print(curr_rest_cluster)
2
# Extracting all the neighbourhoods falling under the cluster 2 - same as that of our restaurant in San Francisco
Cluster0=NY_clustered[NY_clustered['Cluster Labels']==curr_rest_cluster]
Cluster0.shape
(161, 20)

We have 160 options to open our new restaurant. This basically means that 160 neighbourhoods in New York City are similar to our current locality. Though this gives us a lot of options to choose from, let’s try to narrow down our choices by doing another round of clustering in these 160 locations!

NY1_grouped = NY_grouped

# Adding cluster labels to our original dataframe on which the Kmeans clustering was done
NY1_grouped = NY1_grouped.merge(neighborhoods_venues_sorted[['Borough','Neighborhood','Cluster Labels']].set_index(['Borough','Neighborhood']), left_on=['Borough','Neighborhood'],right_on=['Borough','Neighborhood'])
NY1_grouped.head()
Borough Neighborhood Adult Boutique Afghan Restaurant African Restaurant Airport Food Court Airport Lounge Airport Terminal American Restaurant Amphitheater ... Wine Bar Wine Shop Winery Wings Joint Women's Store Yemeni Restaurant Yoga Studio Zoo Zoo Exhibit Cluster Labels
0 Bronx Allerton 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
1 Bronx Bathgate 0.0 0.0 0.0 0.0 0.0 0.0 0.010000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
2 Bronx Baychester 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
3 Bronx Bedford Park 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
4 Bronx Belmont 0.0 0.0 0.0 0.0 0.0 0.0 0.017241 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1

5 rows × 444 columns

# Extracting data for the neighbourhoods which has the same cluster as that of our restaurant
SF_cluster=NY1_grouped[NY1_grouped['Cluster Labels']==curr_rest_cluster]
SF_cluster.shape
(161, 444)

Preparing the revised dataframe for reclustering to identify the neighbourhood which most similar to our current neighbourhod

SF_grouped_clustering = SF_cluster.drop(['Borough','Neighborhood','Cluster Labels'], 1)

Determining which neighbourhood is most similar to our office’s neighbourhood by increasing the K value. Using K value in this manner is indirect way to calculate the distance between the neighbourhoods from a particular neighbourhood.

k=75
# Kmeans clustering
kmeans = KMeans(n_clusters=k, random_state=0).fit(SF_grouped_clustering)
# Inserting the revised clusters in the extracted dataframe
SF1_cluster=SF_cluster
SF1_cluster.drop('Cluster Labels',inplace=True,axis=1)
SF1_cluster.insert(0, 'Cluster Labels', kmeans.labels_)
SF1_cluster
Cluster Labels Borough Neighborhood Adult Boutique Afghan Restaurant African Restaurant Airport Food Court Airport Lounge Airport Terminal American Restaurant ... Whisky Bar Wine Bar Wine Shop Winery Wings Joint Women's Store Yemeni Restaurant Yoga Studio Zoo Zoo Exhibit
5 38 Bronx Bronx Park South 0.0 0.0 0.0 0.0 0.0 0.0 0.035714 ... 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.107143 0.250000
6 38 Bronx Bronx River 0.0 0.0 0.0 0.0 0.0 0.0 0.037037 ... 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.111111 0.259259
9 50 Bronx City Island 0.0 0.0 0.0 0.0 0.0 0.0 0.037037 ... 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.000000
11 45 Bronx Clason Point 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.000000 0.0 0.083333 0.0 0.0 0.000000 0.000000 0.000000
13 40 Bronx Concourse 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
294 68 Staten Island Pleasant Plains 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.000000
297 58 Staten Island Randall Manor 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.000000
298 36 Staten Island Richmond Town 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.166667 0.000000 0.000000
304 43 Staten Island South Beach 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.000000
305 11 Staten Island St. George 0.0 0.0 0.0 0.0 0.0 0.0 0.051282 ... 0.0 0.0 0.025641 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.000000

161 rows × 444 columns

# Identifying the new cluster label
SF1_cluster[SF1_cluster['Neighborhood']=='Hayes Valley']
Cluster Labels Borough Neighborhood Adult Boutique Afghan Restaurant African Restaurant Airport Food Court Airport Lounge Airport Terminal American Restaurant ... Whisky Bar Wine Bar Wine Shop Winery Wings Joint Women's Store Yemeni Restaurant Yoga Studio Zoo Zoo Exhibit
130 11 Hayes Valley, SF Hayes Valley 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.05 0.01 0.0 0.0 0.0 0.0 0.01 0.0 0.0

1 rows × 444 columns

# Extracting the details of the new cluster(=11)
Cluster_Label1=SF1_cluster.loc[SF1_cluster['Neighborhood']=='Hayes Valley','Cluster Labels'].item()
SF1_cluster[SF1_cluster['Cluster Labels']==Cluster_Label1]
Cluster Labels Borough Neighborhood Adult Boutique Afghan Restaurant African Restaurant Airport Food Court Airport Lounge Airport Terminal American Restaurant ... Whisky Bar Wine Bar Wine Shop Winery Wings Joint Women's Store Yemeni Restaurant Yoga Studio Zoo Zoo Exhibit
21 11 Bronx Fordham 0.0 0.0 0.00 0.0 0.0 0.0 0.000000 ... 0.0 0.00 0.000000 0.0 0.00 0.0 0.0 0.000000 0.0 0.0
24 11 Bronx Kingsbridge 0.0 0.0 0.00 0.0 0.0 0.0 0.025641 ... 0.0 0.00 0.000000 0.0 0.00 0.0 0.0 0.025641 0.0 0.0
63 11 Brooklyn Brighton Beach 0.0 0.0 0.00 0.0 0.0 0.0 0.000000 ... 0.0 0.00 0.000000 0.0 0.00 0.0 0.0 0.000000 0.0 0.0
95 11 Brooklyn Homecrest 0.0 0.0 0.00 0.0 0.0 0.0 0.025000 ... 0.0 0.00 0.000000 0.0 0.00 0.0 0.0 0.025000 0.0 0.0
130 11 Hayes Valley, SF Hayes Valley 0.0 0.0 0.00 0.0 0.0 0.0 0.000000 ... 0.0 0.05 0.010000 0.0 0.00 0.0 0.0 0.010000 0.0 0.0
145 11 Manhattan Harlem (Central) 0.0 0.0 0.03 0.0 0.0 0.0 0.010000 ... 0.0 0.01 0.010000 0.0 0.01 0.0 0.0 0.010000 0.0 0.0
154 11 Manhattan Manhattanville 0.0 0.0 0.00 0.0 0.0 0.0 0.000000 ... 0.0 0.00 0.033333 0.0 0.00 0.0 0.0 0.033333 0.0 0.0
179 11 Queens Auburndale 0.0 0.0 0.00 0.0 0.0 0.0 0.037037 ... 0.0 0.00 0.000000 0.0 0.00 0.0 0.0 0.000000 0.0 0.0
217 11 Queens Kew Gardens 0.0 0.0 0.00 0.0 0.0 0.0 0.000000 ... 0.0 0.00 0.000000 0.0 0.00 0.0 0.0 0.020833 0.0 0.0
305 11 Staten Island St. George 0.0 0.0 0.00 0.0 0.0 0.0 0.051282 ... 0.0 0.00 0.025641 0.0 0.00 0.0 0.0 0.000000 0.0 0.0

10 rows × 444 columns

The above 9 options are the closest to our current restaurant locality and can be used to open our new restaurant and help in business expansion!

Finalizing the output

SF2_cluster=SF1_cluster[SF1_cluster['Cluster Labels']==Cluster_Label1].copy()
SF2_cluster.drop("Cluster Labels",inplace=True,axis=1)
SF2_cluster
Borough Neighborhood Adult Boutique Afghan Restaurant African Restaurant Airport Food Court Airport Lounge Airport Terminal American Restaurant Amphitheater ... Whisky Bar Wine Bar Wine Shop Winery Wings Joint Women's Store Yemeni Restaurant Yoga Studio Zoo Zoo Exhibit
21 Bronx Fordham 0.0 0.0 0.00 0.0 0.0 0.0 0.000000 0.00 ... 0.0 0.00 0.000000 0.0 0.00 0.0 0.0 0.000000 0.0 0.0
24 Bronx Kingsbridge 0.0 0.0 0.00 0.0 0.0 0.0 0.025641 0.00 ... 0.0 0.00 0.000000 0.0 0.00 0.0 0.0 0.025641 0.0 0.0
63 Brooklyn Brighton Beach 0.0 0.0 0.00 0.0 0.0 0.0 0.000000 0.00 ... 0.0 0.00 0.000000 0.0 0.00 0.0 0.0 0.000000 0.0 0.0
95 Brooklyn Homecrest 0.0 0.0 0.00 0.0 0.0 0.0 0.025000 0.00 ... 0.0 0.00 0.000000 0.0 0.00 0.0 0.0 0.025000 0.0 0.0
130 Hayes Valley, SF Hayes Valley 0.0 0.0 0.00 0.0 0.0 0.0 0.000000 0.00 ... 0.0 0.05 0.010000 0.0 0.00 0.0 0.0 0.010000 0.0 0.0
145 Manhattan Harlem (Central) 0.0 0.0 0.03 0.0 0.0 0.0 0.010000 0.01 ... 0.0 0.01 0.010000 0.0 0.01 0.0 0.0 0.010000 0.0 0.0
154 Manhattan Manhattanville 0.0 0.0 0.00 0.0 0.0 0.0 0.000000 0.00 ... 0.0 0.00 0.033333 0.0 0.00 0.0 0.0 0.033333 0.0 0.0
179 Queens Auburndale 0.0 0.0 0.00 0.0 0.0 0.0 0.037037 0.00 ... 0.0 0.00 0.000000 0.0 0.00 0.0 0.0 0.000000 0.0 0.0
217 Queens Kew Gardens 0.0 0.0 0.00 0.0 0.0 0.0 0.000000 0.00 ... 0.0 0.00 0.000000 0.0 0.00 0.0 0.0 0.020833 0.0 0.0
305 Staten Island St. George 0.0 0.0 0.00 0.0 0.0 0.0 0.051282 0.00 ... 0.0 0.00 0.025641 0.0 0.00 0.0 0.0 0.000000 0.0 0.0

10 rows × 443 columns

columns1 = ['Borough','Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns1.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns1.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
Final_venues_sorted = pd.DataFrame(columns=columns1)
Final_venues_sorted['Borough']=SF2_cluster['Borough']
Final_venues_sorted['Neighborhood'] = SF2_cluster['Neighborhood']

for ind in np.arange(SF2_cluster.shape[0]):
    Final_venues_sorted.iloc[ind, 2:] = return_most_common_venues(SF2_cluster.iloc[ind, :], num_top_venues)

Final_venues_sorted
Borough Neighborhood 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue 6th Most Common Venue 7th Most Common Venue 8th Most Common Venue 9th Most Common Venue 10th Most Common Venue 11th Most Common Venue 12th Most Common Venue 13th Most Common Venue 14th Most Common Venue 15th Most Common Venue
21 Bronx Fordham Shoe Store Fast Food Restaurant Coffee Shop Sandwich Place Clothing Store Spanish Restaurant Supplement Shop Bank Pharmacy Gym / Fitness Center Pizza Place Café Deli / Bodega Miscellaneous Shop Mobile Phone Shop
24 Bronx Kingsbridge Supermarket Café Gym Pizza Place Donut Shop Mexican Restaurant Sandwich Place Spanish Restaurant Grocery Store Thrift / Vintage Store Gourmet Shop Breakfast Spot Supplement Shop Steakhouse Burger Joint
63 Brooklyn Brighton Beach Supermarket Bakery Eastern European Restaurant Health & Beauty Service Grocery Store Donut Shop Theater Mobile Phone Shop Café Flower Shop Gourmet Shop Bar Food Truck Lounge Bus Line
95 Brooklyn Homecrest Sushi Restaurant Café Pizza Place Bagel Shop Mobile Phone Shop Ice Cream Shop Market Seafood Restaurant Bank Bar Sandwich Place Gym Coffee Shop Mediterranean Restaurant Eastern European Restaurant
130 Hayes Valley, SF Hayes Valley Clothing Store Wine Bar French Restaurant Boutique Mexican Restaurant Pizza Place Performing Arts Venue Cocktail Bar Sushi Restaurant Coffee Shop Park Bakery Optical Shop Café Juice Bar
145 Manhattan Harlem (Central) Southern / Soul Food Restaurant Mobile Phone Shop Clothing Store Theater Pizza Place Burger Joint African Restaurant Cosmetics Shop Sandwich Place Café Arts & Crafts Store Mexican Restaurant Japanese Restaurant Jazz Club Deli / Bodega
154 Manhattan Manhattanville Art Gallery Fried Chicken Joint Coffee Shop Boutique Seafood Restaurant Chinese Restaurant Sandwich Place Lounge Ethiopian Restaurant Bank Public Art College Theater Park Spa Food & Drink Shop
179 Queens Auburndale Pharmacy Café Hookah Bar Sandwich Place Train Station Train Lounge Pizza Place Miscellaneous Shop Greek Restaurant Athletics & Sports Toy / Game Store Donut Shop Korean Restaurant Vietnamese Restaurant
217 Queens Kew Gardens Pizza Place Metro Station Café Coffee Shop Mediterranean Restaurant Donut Shop Cosmetics Shop Deli / Bodega Nail Salon Bagel Shop Supplement Shop Bank Train Station Gym / Fitness Center Grocery Store
305 Staten Island St. George Clothing Store Italian Restaurant Sporting Goods Shop American Restaurant Bar Bakery Museum Monument / Landmark Deli / Bodega Tapas Restaurant Pharmacy Theater Farmers Market Bus Stop Seafood Restaurant

Result: We were able to identify 9 closest neighbourhoods in New York City which are ideal to open up our new restaurant and expand our business.

There are a few things than can be further developed, like a better value of “k” since elbow method didn’t provide a clear option. We could use Average silhouette method or Gap statistic method in order to do so.

The final decision needs to be made basis a lot more information like availability of space, taxes involved, ease of transportation etc.