Collecting coordinates data from Wikipedia for Indian Cities.
Requirements
While working on data that involves the city details. For modelling this city data it needs to go through feature engineering.
Most of the cases we usually go with either of following two approaches:
- One-hot Encoding
- Target Encoding
While this works for most of the cases. I always felt that there could be a better approach to handle this. Then I found a idea to convert the city into its respective longitude and latitude coordinates.
This way the geographical proximity of the city is captured into the feature.
How to collect this data?
Although there are some already available datasets for this but i couldn't find any that has 500+ India cities data with coordinates.
Step 1: Selecting the target cities
I found a britannica link containing the names of major cities in India. So there i begin collecting cities name and storing them in a list.
cities_list
# list of target cities
Step 2: Scraping the coordinates data.
I found out that Wikipedia has the coordinate data of almost all geographical locations. So decided to go ahead with it.
# Reference : https://stackoverflow.com/questions/65162011/how-to-get-coordinates-from-a-wikipedia-page-using-python
def get_coordinates(city_name):
"""
Given the city name return either a tuple of (latitude, longitude) or a str( Not found message)
"""
flag = None
req = requests.get(f"https://en.wikipedia.org/wiki/{city_name}")
if req:
soup = BeautifulSoup(req.text)
latitude = soup.find("span", {"class": "latitude"})
longitude = soup.find("span", {"class": "longitude"})
if latitude is not None and longitude is not None:
# test is city name is not recognized properly
return latitude.text, longitude.text
flag = f"Details not found for {city_name}"
return flag
latitude_list = []
longitude_list = []
found_cities = []
not_found_cities = []
for city in cities_list:
result = get_coordinates(city_name=city)
if isinstance(result, str):
not_found_cities.append(city)
else:
latitude_list.append(result[0])
longitude_list.append(result[1])
found_cities.append(city)
time.sleep(0.5) # sleep because not to overload the wiki page
print(city)
As we all know the real world data is messy. So got 81 cities which failed the above method.
len(not_found_cities)
#Output: 81
# Saving the data to csv.
df = pd.DataFrame(
{"cities": found_cities, "longitude": longitude_list, "latitude": latitude_list}
)
df.to_csv("cities_srapped.csv")
Step 3: Handling the messy data
After tinkering a for sometime found Wikipedia has a search API as well.
url_cities = []
URL = "https://en.wikipedia.org/w/api.php"
for city in not_found_cities:
S = requests.Session()
PARAMS = {
"action": "opensearch",
"namespace": "0",
"search": f"{city} India", # added search term ending with india
"limit": "5",
"format": "json",
}
R = S.get(url=URL, params=PARAMS)
DATA = R.json()
time.sleep(0.5)
for info in DATA:
for i in info:
if "http" in i and city in i:
print(city, i)
url_cities.append((city, i))
cleaned_url_city_pair = []
for c in url_cities:
condition_india = c[1].endswith("India") or c[1].endswith("India)")
condition_same_city = c[0] == c[1].split("/")[-1].split(",")[0]
if condition_india and condition_same_city:
print(c)
cleaned_url_city_pair.append(c)
url_cities got duplicate values as well as similar spelled wiki links, so clean it those and stored into cleaned_url_city_pair.
Now we have the city name and respective wiki URL so wrote a function to extract coordinates given the wiki URL of the city.
def get_coordinates_by_url(url):
flag = None
req = requests.get(f"{url}")
if req:
soup = BeautifulSoup(req.text)
latitude = soup.find("span", {"class": "latitude"})
longitude = soup.find("span", {"class": "longitude"})
if latitude is not None and longitude is not None:
# test is city name is not recognized properly
return latitude.text, longitude.text
flag = f"Details not found for {url}"
return flag
latitude_list = []
longitude_list = []
not_found_url_cities = [] # Url not found properly
found_url_cities = []
for url in cleaned_url_city_pair:
# url_cities has (city, url) pairs
result = get_coordinates_by_url(url[1])
if isinstance(result, str):
not_found_url_cities.append(url[0])
else:
latitude_list.append(result[0])
longitude_list.append(result[1])
found_url_cities.append(url[0])
time.sleep(0.5)
print(url[0])
# Saving the df
df = pd.DataFrame(
{"cities": found_url_cities, "longitude": longitude_list, "latitude": latitude_list}
)
df.to_csv("cities_url.csv")
To my suprise not_found_url_cities is not empty it contained 6 cities.
Step 4: Messy real world data never disappoints.
manual_cities = set(not_found_url_cities).union(set(not_found_cities) - set(found_url_cities))
len(manual_cities)
# 54
So there are 54 cities which failed both the above methods. Then either i could leave these cities out or hear song like “NEVER SAY NEVER” and find the wiki url by googling these cities.
Here we go
latitude_list = []
longitude_list = []
not_found_manual_cities = [] # Url not found properly
found_manual_cities = []
for city in manual_cities_urls:
# url_cities has (city, url) pairs
result = get_coordinates_by_url(manual_cities_urls[city])
if isinstance(result, str):
not_found_manual_cities.append(city)
else:
latitude_list.append(result[0])
longitude_list.append(result[1])
found_manual_cities.append(city)
time.sleep(0.5)
print(city)
df = pd.DataFrame(
{
"cities": found_manual_cities,
"longitude": longitude_list,
"latitude": latitude_list,
}
)
df.to_csv("cities_manual.csv")
Now we have 3 different csv files. Its time for a merge.
df1 = pd.read_csv("cities_manual.csv")
df2 = pd.read_csv("cities_srapped.csv")
df3 = pd.read_csv("cities_url.csv")
all_cities = pd.concat([df1, df2,df3])
all_cities.drop(columns=["Unnamed: 0"], inplace = True)
all_cities.to_csv("All_cities_df.csv",index = False)
If you want to access the dataset directly you can head to Kaggle.
From now onwards if there is city data available we can incooperate it using coordinates.
Following are the industries where city data shows up frequently.
- E commerce
- Banking
- Transportation ( Supply chain management)