Collecting coordinates data from Wikipedia for Indian Cities.

Tushar Tiwari
Geek Culture
Published in
4 min readFeb 23, 2023

--

Photo by NASA on Unsplash

Requirements

While working on data that involves the city details. For modelling this city data it needs to go through feature engineering.
Most of the cases we usually go with either of following two approaches:

  1. One-hot Encoding
  2. Target Encoding

While this works for most of the cases. I always felt that there could be a better approach to handle this. Then I found a idea to convert the city into its respective longitude and latitude coordinates.

This way the geographical proximity of the city is captured into the feature.

How to collect this data?

Although there are some already available datasets for this but i couldn't find any that has 500+ India cities data with coordinates.

Step 1: Selecting the target cities

I found a britannica link containing the names of major cities in India. So there i begin collecting cities name and storing them in a list.

cities_list
# list of target cities

Step 2: Scraping the coordinates data.

I found out that Wikipedia has the coordinate data of almost all geographical locations. So decided to go ahead with it.

Photo by Luke Chesser on Unsplash
# Reference : https://stackoverflow.com/questions/65162011/how-to-get-coordinates-from-a-wikipedia-page-using-python
def get_coordinates(city_name):
"""
Given the city name return either a tuple of (latitude, longitude) or a str( Not found message)
"""
flag = None
req = requests.get(f"https://en.wikipedia.org/wiki/{city_name}")
if req:
soup = BeautifulSoup(req.text)
latitude = soup.find("span", {"class": "latitude"})
longitude = soup.find("span", {"class": "longitude"})
if latitude is not None and longitude is not None:
# test is city name is not recognized properly
return latitude.text, longitude.text

flag = f"Details not found for {city_name}"
return flag

latitude_list = []
longitude_list = []
found_cities = []
not_found_cities = []
for city in cities_list:

result = get_coordinates(city_name=city)
if isinstance(result, str):
not_found_cities.append(city)
else:
latitude_list.append(result[0])
longitude_list.append(result[1])
found_cities.append(city)
time.sleep(0.5) # sleep because not to overload the wiki page
print(city)

As we all know the real world data is messy. So got 81 cities which failed the above method.

len(not_found_cities)
#Output: 81

# Saving the data to csv.
df = pd.DataFrame(
{"cities": found_cities, "longitude": longitude_list, "latitude": latitude_list}
)
df.to_csv("cities_srapped.csv")

Step 3: Handling the messy data

Photo by PAN XIAOZHEN on Unsplash

After tinkering a for sometime found Wikipedia has a search API as well.

url_cities = []
URL = "https://en.wikipedia.org/w/api.php"
for city in not_found_cities:
S = requests.Session()
PARAMS = {
"action": "opensearch",
"namespace": "0",
"search": f"{city} India", # added search term ending with india
"limit": "5",
"format": "json",
}

R = S.get(url=URL, params=PARAMS)
DATA = R.json()
time.sleep(0.5)
for info in DATA:
for i in info:
if "http" in i and city in i:
print(city, i)
url_cities.append((city, i))


cleaned_url_city_pair = []
for c in url_cities:
condition_india = c[1].endswith("India") or c[1].endswith("India)")
condition_same_city = c[0] == c[1].split("/")[-1].split(",")[0]
if condition_india and condition_same_city:
print(c)
cleaned_url_city_pair.append(c)

url_cities got duplicate values as well as similar spelled wiki links, so clean it those and stored into cleaned_url_city_pair.

Now we have the city name and respective wiki URL so wrote a function to extract coordinates given the wiki URL of the city.

def get_coordinates_by_url(url):
flag = None
req = requests.get(f"{url}")
if req:
soup = BeautifulSoup(req.text)
latitude = soup.find("span", {"class": "latitude"})
longitude = soup.find("span", {"class": "longitude"})
if latitude is not None and longitude is not None:
# test is city name is not recognized properly
return latitude.text, longitude.text

flag = f"Details not found for {url}"
return flag

latitude_list = []
longitude_list = []
not_found_url_cities = [] # Url not found properly
found_url_cities = []
for url in cleaned_url_city_pair:
# url_cities has (city, url) pairs
result = get_coordinates_by_url(url[1])
if isinstance(result, str):
not_found_url_cities.append(url[0])
else:
latitude_list.append(result[0])
longitude_list.append(result[1])
found_url_cities.append(url[0])
time.sleep(0.5)
print(url[0])

# Saving the df
df = pd.DataFrame(
{"cities": found_url_cities, "longitude": longitude_list, "latitude": latitude_list}
)
df.to_csv("cities_url.csv")

To my suprise not_found_url_cities is not empty it contained 6 cities.

Step 4: Messy real world data never disappoints.

Photo by Ricardo Viana on Unsplash
manual_cities = set(not_found_url_cities).union(set(not_found_cities) - set(found_url_cities))
len(manual_cities)
# 54

So there are 54 cities which failed both the above methods. Then either i could leave these cities out or hear song like “NEVER SAY NEVER” and find the wiki url by googling these cities.

Here we go

latitude_list = []
longitude_list = []
not_found_manual_cities = [] # Url not found properly
found_manual_cities = []
for city in manual_cities_urls:
# url_cities has (city, url) pairs
result = get_coordinates_by_url(manual_cities_urls[city])
if isinstance(result, str):
not_found_manual_cities.append(city)
else:
latitude_list.append(result[0])
longitude_list.append(result[1])
found_manual_cities.append(city)
time.sleep(0.5)
print(city)

df = pd.DataFrame(
{
"cities": found_manual_cities,
"longitude": longitude_list,
"latitude": latitude_list,
}
)
df.to_csv("cities_manual.csv")

Now we have 3 different csv files. Its time for a merge.

df1 = pd.read_csv("cities_manual.csv")
df2 = pd.read_csv("cities_srapped.csv")
df3 = pd.read_csv("cities_url.csv")
all_cities = pd.concat([df1, df2,df3])

all_cities.drop(columns=["Unnamed: 0"], inplace = True)
all_cities.to_csv("All_cities_df.csv",index = False)

If you want to access the dataset directly you can head to Kaggle.

Photo by Robynne Hu on Unsplash

From now onwards if there is city data available we can incooperate it using coordinates.

Following are the industries where city data shows up frequently.

  1. E commerce
  2. Banking
  3. Transportation ( Supply chain management)

References:

  1. https://www.britannica.com/topic/list-of-cities-and-towns-in-India-2033033
  2. https://stackoverflow.com/questions/65162011/how-to-get-coordinates-from-a-wikipedia-page-using-python

Contact: Linkedin ||| Email

--

--

Tushar Tiwari
Geek Culture

Finding insights from data | Fascinated by how Markets works.