Link to Jupyter notebook;

The busiest passenger rail line in the United States, Amtrak also operates passenger rail service; together the NJ Transit and Amtrak operate nearly 750 trains across the NJ Transit rail network. This data is organized into a monthly performance report of nearly every train trip on the rail network with the features; date, train ID, stop sequence, station train is traveling from, station ID of train origin, station train is traveling to, train destination ID, scheduled time, actual time, and delay in minutes. We will clean and analyze these features for further machine learning.

The business solution proposed is to create a classification model that can predict when a train is going to be late or not based on the features given. Clustering can also be used to segment rail lines that are in need of increased infrastructure and attention to upgrades to better serve the rail lines and prevent congestion and commute time. The purpose would be to better inform riders in advance of potential delays and cancellation using data driven machine learning algorithms. This would decrease congestion at rail stations where delayed crowds can often stagnate and minimize customer complaints and refunds by giving timelier notice of potential issues.

I will be using various models and learning methods such as supervised learning (logistic regression, gradient boosting, KNN classifier, decision tree, random forest, SelectKBest, PCA, GridSearchCV hyperparameter tuning), unsupervised learning (t-SNE, PCA, KMeans and DBSCAN clustering) and deep learning models to implement the business solution above.

Data originates from here;

Categories: Portfolio