Pump it Up : Data-Mining the Water Table
Link to the Databricks notebook: Click here
The goal is to predict the operating condition of a waterpoint for each record in the dataset.You are provided the following set of information about the waterpoints
- amount_tsh - Total static head (amount water available to waterpoint)
- date_recorded - The date the row was entered
- funder - Who funded the well
- gps_height - Altitude of the well
- installer - Organization that installed the well
- longitude - GPS coordinate
- latitude - GPS coordinate
- wpt_name - Name of the waterpoint if there is one
- num_private -
- basin - Geographic water basin
- subvillage - Geographic location
- region - Geographic location
- region_code - Geographic location (coded)
- district_code - Geographic location (coded)
- lga - Geographic location
- ward - Geographic location
- population - Population around the well
- public_meeting - True/False
- recorded_by - Group entering this row of data
- scheme_management - Who operates the waterpoint
- scheme_name - Who operates the waterpoint
- permit - If the waterpoint is permitted
- construction_year - Year the waterpoint was constructed
- extraction_type - The kind of extraction the waterpoint uses
- extraction_type_group - The kind of extraction the waterpoint uses
- extraction_type_class - The kind of extraction the waterpoint uses
- management - How the waterpoint is managed
- management_group - How the waterpoint is managed
- payment - What the water costs
- payment_type - What the water costs
- water_quality - The quality of the water
- quality_group - The quality of the water
- quantity - The quantity of water
- quantity_group - The quantity of water
- source - The source of the water
- source_type - The source of the water
- source_class - The source of the water
- waterpoint_type - The kind of waterpoint
- waterpoint_type_group - The kind of waterpoint
The labels in this dataset
Distribution of Labels
The labels in this dataset are simple. There are three possible values:
- functional - the waterpoint is operational and there are no repairs needed
- functional needs repair - the waterpoint is operational, but needs repairs
- non functional - the waterpoint is not operational
Coding Language/Environment used:
We have use the Databricks platform to achieve our task. The programming language used is scala and we used Spark’s MLlib library to build the model. We ran it on a Community Optimized Spark 2.1 cluster with 6GB of memory.
Initially, we merged the Training set values with the corresponding Training set labels into a single CSV file with the help of ID column. Having the label in the same place as the features makes it easy to build the model. Then, we mapped the class labels, non-functional, functional, functional needs repair to 0, 1, 2 respectively.
We used 80% of the original data as the training data to build our model. We evaluated the model on rest of the 20% data. We used the Mean Square error metric to do that task. As this is a multi-label classification problem, we experimented with several classification techniques and found Logistic regression to be appropriate. We used the MLlib’s Logistic regression, to build our model and we obtained labels for the Test set values CSV file. We used assembler, normalizer and pipeline concepts to build the model.
Results
We have split the data into 80-20 and calculated the Mean Squared Error with the help of MLlib’s RegressionMetrics and found the value to be 0.257, which is decent.
We have classified our test data. A screenshot of a few results is provided below.