Pump it Up : Data-Mining the Water Table

Link to the Databricks notebook: Click here

The goal is to predict the operating condition of a waterpoint for each record in the dataset.You are provided the following set of information about the waterpoints

amount_tsh - Total static head (amount water available to waterpoint)
date_recorded - The date the row was entered
funder - Who funded the well
gps_height - Altitude of the well
installer - Organization that installed the well
longitude - GPS coordinate
latitude - GPS coordinate
wpt_name - Name of the waterpoint if there is one
num_private -
basin - Geographic water basin
subvillage - Geographic location
region - Geographic location
region_code - Geographic location (coded)
district_code - Geographic location (coded)
lga - Geographic location
ward - Geographic location
population - Population around the well
public_meeting - True/False
recorded_by - Group entering this row of data
scheme_management - Who operates the waterpoint
scheme_name - Who operates the waterpoint
permit - If the waterpoint is permitted
construction_year - Year the waterpoint was constructed
extraction_type - The kind of extraction the waterpoint uses
extraction_type_group - The kind of extraction the waterpoint uses
extraction_type_class - The kind of extraction the waterpoint uses
management - How the waterpoint is managed
management_group - How the waterpoint is managed
payment - What the water costs
payment_type - What the water costs
water_quality - The quality of the water
quality_group - The quality of the water
quantity - The quantity of water
quantity_group - The quantity of water
source - The source of the water
source_type - The source of the water
source_class - The source of the water
waterpoint_type - The kind of waterpoint
waterpoint_type_group - The kind of waterpoint

The labels in this dataset

Distribution of Labels
The labels in this dataset are simple. There are three possible values:

functional - the waterpoint is operational and there are no repairs needed
functional needs repair - the waterpoint is operational, but needs repairs
non functional - the waterpoint is not operational

Coding Language/Environment used:

We have use the Databricks platform to achieve our task. The programming language used is scala and we used Spark’s MLlib library to build the model. We ran it on a Community Optimized Spark 2.1 cluster with 6GB of memory.

Initially, we merged the Training set values with the corresponding Training set labels into a single CSV file with the help of ID column. Having the label in the same place as the features makes it easy to build the model. Then, we mapped the class labels, non-functional, functional, functional needs repair to 0, 1, 2 respectively.

We used 80% of the original data as the training data to build our model. We evaluated the model on rest of the 20% data. We used the Mean Square error metric to do that task. As this is a multi-label classification problem, we experimented with several classification techniques and found Logistic regression to be appropriate. We used the MLlib’s Logistic regression, to build our model and we obtained labels for the Test set values CSV file. We used assembler, normalizer and pipeline concepts to build the model.

Results

We have split the data into 80-20 and calculated the Mean Squared Error with the help of MLlib’s RegressionMetrics and found the value to be 0.257, which is decent.
We have classified our test data. A screenshot of a few results is provided below.

Pump-it-Up-Data-Mining-the-Water-Table

The goal is to predict the operating condition of a waterpoint for each record in the dataset.

Pump it Up : Data-Mining the Water Table

Link to the Databricks notebook: Click here

The labels in this dataset

Coding Language/Environment used:

Results