Saturday, October 15, 2016

A network of physicists at IIT Madras

I started to find networks interesting, especially because of the insights they can provide into the system. Earlier, I worked on making a network of Universities based on co-authorship on publications. Studying such network and their evolution can be helpful. For example, if an ongoing multi-university collaboration is successful without the knowledge and support of the host universities, such analysis can be a way to lobby for official support.

On similar terms, I created a new network of physicists at the Department of Physics at the Indian Institute of Technology at Madras (which is my almamater). They revamped the department's website, specifically the Recent Publications page, which is updated with publications of the faculty in the department. As you can see from the table, each row/paper contains a list of authors. By collecting such lists, we can make a network which shows who collaborates with who and without prior knowledge, take a guess at which labs collaborates with which other and so on.

I used the following python code to extract the lists and construct a network. The network can be seen on my website. It was easier to display it there instead of on this blogpost here. Hover over the individual nodes on the graph to get the name of the faculty/student member the node represents.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import json

import pandas
import networkx as nx
from networkx.readwrite import json_graph


url_template = 'https://physics.iitm.ac.in/researchinfo?page={}'
authorlist = []


for i in range(8):
    # we can pass a url as the first argument to pandas.read_html
    # and it returns a list of data frames
    df_list = pandas.read_html(url_template.format(i),
                               header=0,
                               index_col=0
    )
    df = df_list[0]

    # column containing author names needs to be cleaned
    df.Authors = df.Authors.str.lower()
    df.Authors = df.Authors.str.strip()
    df.Authors = df.Authors.str.replace('*', ' ')
    df.Authors = df.Authors.str.replace('and ', ',')
    df.Authors = df.Authors.str.replace('&', ',')

    # Split column containing authors on ","
    # split is a data frame i.e 2D array
    split = df['Authors'].str.split(u',', expand=True)
    split.columns = ['Authors_split_{0}'.format(i)
                     for i in range(len(split.columns))]
   
    # strip author names of whitespaces
    for column in split:
        split[column] = split[column].str.strip()

    # each row contains authors of a paper
    # the row might contains NaNs, which is why we use dropna
    for i in range(len(split)-1):
        authorlist.append(list(split.iloc[i].dropna()))


G = nx.Graph()

# link each author to the other authors on each paper
for list in authorlist:
    for pos, node1 in enumerate(list):
        for node2 in list[pos:]:
            # there might be empty strings or whitespaces in the author list
            if node1 != u'' and node2 != u'' and node1 != u' ' and node2 != u' ':
                G.add_edge(node1, node2)

# label each node with the author's name
for n in G:
    G.node[n]['name'] = n

# draw the graph using networkx's Graph object
pos = nx.spring_layout(G)
nx.draw_networkx_nodes(G, pos, node_size=100, node_color='blue')
nx.draw_networkx_edges(G, pos, edge_color='green')
nx.draw_networkx_labels(G, pos, font_color='red')

# convert the Graph object into a JSON object
# we use the JSON object using D3
d = json_graph.node_link_data(G)
json.dump(d, open('force.json', 'w'))

The code is highlighted and formatted using hilite.me.

Sunday, September 11, 2016

Mapping out the train routes in India

I had nothing better to do on a Sunday morning so I made a map. If you have been following my blog, you will know that I am trying to make sense of the Indian Railways, why the trains runs late  and if there's anyway we can learn about the cause of the delays, based on patters in delay.

Towards that goal, I posted two blog posts that described how and where I was collecting my data from and a first look at the data I was gathering. Even from the preliminary look, it can be seen that there are specific routes/stations that are causing a delay along a train's route. And the delays were induced on multiple runs at the same station, meaning that it wasn't simply a one time thing.

Moving on, another way to look at the problem is to understand how crowded the railway lines are. By mapping all train movement in India, we will be able to understand how crowded specific routes are and if they're crowded at a specific time of the day. By adding delay information from multiple trains to the map, we will also be able to prove with good certainty about the routes that are crowded and are leading to large delays.

I took a small step in that direction today by mapping out the routes of a few trains. Below you will find one such map/image, where the red lines correspond to the routes of a few trains. Note that this isn't the complete roster of trains that are run by the Indian Railways, it is but a very very small subset. But the process by which such a map can be made is scalable, the small subset to make this proof of concept.



While this image isn't interactive, running the code actually creates an interactive image, which displays station codes when the user hovers over the route. I modified an available example that produced a USA flight paths map using plotly.  The code can be found below.



import glob
import plotly.plotly as py
import pandas as pd


files = glob.glob('routes/*')

df_stations = pd.DataFrame(columns=['station', 'lat', 'long'])
df_station_paths = pd.DataFrame(columns=['start_lon', 'start_lat', 'end_lon', 'end_lat'])

for file in files:
    df_stations_temp = pd.read_csv(file,
                                   names=['station', 'lat', 'long'],
                                   na_values=[0.0])
    df_stations_temp = df_stations_temp.dropna(axis=0, how='any')
    df_station_paths_temp = pd.DataFrame([[df_stations_temp.iloc[i]['long'],
                                          df_stations_temp.iloc[i]['lat'],
                                          df_stations_temp.iloc[i+1]['long'],
                                          df_stations_temp.iloc[i+1]['lat']] 
                                         for i in range(len(df_stations_temp)-1)],
                                        columns=['start_lon', 'start_lat', 'end_lon', 'end_lat'])
    df_stations = pd.concat([df_stations, df_stations_temp], ignore_index=True)
    df_station_paths = pd.concat([df_station_paths, df_station_paths_temp], ignore_index=True)

stations = [ dict(
        type = 'scattergeo',
        locationmode = 'India',
        lon = df_stations['long'],
        lat = df_stations['lat'],
        hoverinfo = 'text',
        text = df_stations['station'],
        mode = 'markers',
        marker = dict( 
            size=2, 
            color='rgb(255, 0, 0)',
            line = dict(
                width=3,
                color='rgba(68, 68, 68, 0)'
            )
        ))]
        
station_paths = []
for i in range( len( df_station_paths ) ):
    station_paths.append(
        dict(
            type = 'scattergeo',
            locationmode = 'India',
            lon = [ df_station_paths['start_lon'][i], df_station_paths['end_lon'][i] ],
            lat = [ df_station_paths['start_lat'][i], df_station_paths['end_lat'][i] ],
            mode = 'lines',
            line = dict(
                width = 1,
                color = 'red',
            ),
            opacity = 1.,
        )
    )
    
layout = dict(
        showlegend = False, 
        height=1000,
        geo = dict(
            scope='India',
            projection=dict( type='azimuthal equal area' ),
            showland = True,
            landcolor = 'rgb(243, 243, 243)',
            countrycolor = 'rgb(204, 204, 204)',
        ),
    )

fig = dict( data=station_paths+stations, layout=layout )
py.iplot( fig, filename='d3-station-paths' )


To briefly go over the code, the route files for individual trains were stored in `routes/train_number.csv` and each file contained three columns - station code along route, lat, long. Note that the locations and station codes along the route of a specific train were acquired using RailwayAPI. From the file, the above code first creates a Pandas DataFrame, which is then manipulated to create a new DataFrame that contains the train's path/route. These two DataFrames are finally modified and passed onto plotly, which creates the above map.

The map is far from perfect. For starters, like I mentioned earlier, it is but a small subset of all trains available. Secondly, the lat/long data seems to be faulty, because there seem to be stray lines that deviate from a train's actual route in the map.

I am trying to look for a better source of information than RailwayAPI. I am trying to get a list of all trains run by Indian Railways. I am trying to find an official source of information, from the Indian Railways. I am trying to find an easier way to make such a map, and make it interactive. If there's something I can do to make the map/processing better, point it out to me! I'd love to hear comments/feedback.

Until the next time ...

Sunday, August 21, 2016

A tale of two trains : The Indian Railways

Last week, I started collecting the running status of a few (<10) trains everyday. I wrote a blogpost last week about how I was collecting the data if you want to know more. Now, let's look at what I've collected so far.

(Open in the following images in a new page to take a better look at which stations are the most problematic and understand the general trend better)

Train 18029 - runs from Lokmanyatilak (Mumbai) to Shalimar. This train is mostly representative of what happens with the rest of the trains discussed below. There are stations enroute where the train makes up for lost time and then it loses any gains made. But, for the most part, I guess the delays are acceptable, given that they're within an hour of expected arrival time.


Train 12809 - runs from Mumbai CST to Howrah JN. This train was a little surprising because it's different compared to the rest of the lot. The train almost always makes up for delays in at the start of the route. There are a few places where there's a drastic reduction in delay but the gains are offset a few stations later (thrice)!




Train 12322 - runs from Mumbai CST to Howrah JN. This train displays two interesting trends. The first is that even though there are stations enroute where the train makes up for lost time (twice), it gets delayed again almost immediately. The second interesting trend is that beyond a certrain point enroute, the delay persists, and in 2/4 cases, the train can't make up for lost time.



Train 12622 - runs from New Delhi to Chennai Central. Can't complain about this train.



Train 12616 - runs from Delhi to Chennai Central. The interesting thing to note here is that there are points enroute where the train makes up for lost time - but, it gets delayed again almost immediately, negating any reduction in delay.


Train 12424 - runs from New Delhi to Dibrugarh Town via Guwahati. This train is just sad. At no point enroute does it show any prospect of making up lost time, if it's late.




Train 14056 - runs from Delhi to Dibrugarh via Guwahati. The running status of the train looks a little weird, doesn't it? After a certain point, the delays become very predictable instead of random. That is because I was asking for the running status of train at the wrong time - when the train was still in enroute. Of course, if I ask for the running status while a train in enroute, all I will get is estimated delay at future stations. Which is the reason behind long horizontal lines followed by dips.


Train 15910 - runs from Lalgarh JN to Dibrugarh via Guwahati. The running status of the above train shows the same behavior as the earlier one (14056) i.e asking for the running status while the train in still enroute WILL give me faulty estimates of delay beyond the current position of the train. And of course, the it's in the Indian Railways' best interests to estimate no delay instead of providing more accurate estimates.

That's all for now folks. I know, we didn't learn too much above why the delays are being caused or what routes lead to the most delay but we'll get there. I think. I'll try. I'll post the code I used to analyze the data and generate the plots tomorrow. If you can gleam anything more from the plots above or any other comments that you'd like to pass on to me, I'm all ears.

Sunday, August 14, 2016

Dude, Where's my Train? - The Indian Railways.

I have a friend who was traveling from New Delhi to Guwahati and due to certain constraints, she had to take train number 12502. She had made further plans of traveling from Guwahati based on the assumption that she would reach Guwahati at the expected arrival time. You all know where this is going. The train was late and she had to make changes to her travel plans. We all have either known someone who went through this or personally went through this ourselves. Usually, trains run by the Indian Railways are not more than an hour late. There are ones who run perfectly on time too. And then there are also trains that are multiple hours late, sometimes even > 6! Which I don't think is acceptable. And because I had nothing better to do on a Sunday, I set about to do something about it.

If you guys have read a few of my earlier blog posts, you know where this is going. I'm going to write some code that will help me automate something. Or get some data. Or make a plot or a map. This too will be more of the same. In the context of trains run by the Indian Railways.

Let's start with something simple - trains that are cancelled. Everyday, the Indian Railways announces Fully/Partially Cancelled Trains for that day. It doesn't state a reason as to why they were cancelled. And I never really had a reason to check this list before. Until now.

Let me first state what I want to do. I want to see if there are trains that have been cancelled everyday. And then look for reasons as to why they might be. Think about it. Why would the Indian Railways even list a train if it's being cancelled every day? Are there costs involved with taking a train off of service or rotation? Are there costs involved with maintaining a train, even though it's being cancelled every day?

To find out, let's write some code. One other way to find out would be to manually make a list of trains cancelled every day from this website but I'm wayyy too lazy for that. In order to automate the task of getting every day's list of cancelled trains, I used the RailwayAPI website. These people are not affiliated to the Indian Railways but they seem to be offering most of the information I need. I've cross checked the list these RailwayAPI people are returning me is the same that Indian Railways displays so the data from RailwayAPI seems to be correct. With that, let's move on to some code.

from api_key import API_KEY as MY_API_KEY

import urllib
import json

from datetime import datetime

today = datetime.today()

day = today.day
month = today.month
year = today.year

URL_TEMPLATE = "http://api.railwayapi.com/cancelled/date/{day}-{month}-{year}/apikey/{APIKEY}/"
url = URL_TEMPLATE.format(day=day, month=month, year=year, APIKEY=MY_API_KEY)

response = urllib.urlopen(url)
json_response = response.read()
response = json.loads(json_response)

filename = "{day}-{month}-{year}.can".format(day=day, month=month, year=year)

with open(filename, 'w') as f:
    for train in response['trains']:
        f.write("{} \n".format(train['train']['number']))

Let me explain what is happening in the above code. The API_KEY in the first line is used to let me access their data. The same way Facebook recognizes you using a username and password, the RailwayAPI people recognize me using the API_KEY. I had to register with them to get my API_KEY and I can't display it publicly. You too can get one by signing up with them. Which is why I'm importing it as a variable, that was defined in another file.

Moving on. The URL_TEMPLATE is what I need to query to get information on cancelled trains. You can see that there are placeholders for date in the URL, which are filled in in the next step. The response is made using the urllib library available in the Python Standard Library. The next few steps are simply making the request, getting a response and converting the response into a meaningful format.

The last few steps involve writing the response to a file. The complete response contains information about where the train departs from and what it's destination is, what it's number is and so on. I don't need all of that information. I just want the train number. Which is what I'm writing to the file in the last time. You will need to look at the response yourself to understand the it's structure.

All I had to do now was run this code everyday. But because I'm too lazy to manually run this code everyday, I automated that too. Cron comes to the rescue. I mentioned cron in one of my earlier posts. It's a way to run tasks/jobs periodically on Linux/OSX. I simply had to add

01 00  * * * /usr/bin/python /path/to/script/query_for_cancelled.py

to my list of automated cron tasks/jobs, that can be accessed using

crontab -e

Again. All I'm doing is call the script query_for_cancelled using Python (/usr/bin/python is the full path to the executable) at 00:01 AM every day. There ends the code.

Now for a preliminary look at the results. I queried for the trains cancelled on August 03, 04, 05 and 06 and then got the trains that were cancelled on all of those days, a few of which are - 18235, 18236, 19943, 19944, 22183, 22184, 22185, 22186, 24041, 24042. Notice that there are 5 pairs in total, where two numbers in a pair only differ in the last digit. If you don't know already, last digit changes differentiate trains from A to B and B to A. All of those trains are a little weird. Especially the last one - 24041 and 24042. It's a train that is supposed to travel 25 KMs. Umm, what?

Wednesday, July 27, 2016

Playing around with errors in Python - NameErrors

Let's start with NameErrors, which is one of the more common errors that a newcomer to Python will come across. It is reported when Python can't find a local or global variable in the code. One reason this might pop up is because a variable is being referred to outside of it's namespace. To give you an example

a = 10

def test_f():
    a = 20
    print a

test_f()
print a

Let's walk through the code. After defining the variable a and the function test_f, you would naively expect the test_f() function call to change the value of a to 20 and print 20. You expect the print statement after the function call to also print 20. Because you expect the function call to have changed the value of a. But, if you try running the code for yourself, you'll notice that the final print statement will print 10. This is where namespaces come into the picture.

Now, let's try this instead

def test_f():
    b = 20
    print b

test_f()
print b

The call to the test_f function will set and print the variable b but the print statement afterwards will throw a NameError because outside of the test_f function, the variable b isn't defined.

Let's look at another example, this time in the context of classes in Python.

class test_c:
    b = 20
    def test_f(self):
        print b

test_c().test_f()

Let me explain the last statement first and then the rest of the example. test_c() creates an instance of the class test_c and test_c().test_f() calls the test_f method on the instance of the class. Naively, you would expect the code print 20, which is the value of the variable b in the class. But instead, you will get a NameError, telling you that the variable b isn't defined. The solution to this problem is to refer to b as self.b inside any of the methods defined on test_c, which tells Python that this variable belongs to the class on which the method is defined.

There are definitely a lot more ways in which you can make Python throw a NameError at you but I wanted to use the NameError to introduce the concept of namespaces in python. That's all for now. And as always, I am thankful for any feedback on the writing style and/or content. Until next time ...

[1]. hilite.me was used to create the inline code blocks.
[2]. You can refer to the Python official documentation on namespaces for more information.
[3]. A Python Shell can be accessed on the official Python page. A more comprehensive editor can be found here.