Pump it Up : Data-Mining the Water Table

The goal is to predict the operating condition of a waterpoint for each record in the dataset.You are provided the following set of information about the waterpoints

The labels in this dataset

Distribution of Labels
The labels in this dataset are simple. There are three possible values:

Coding Language/Environment used:

We have use the Databricks platform to achieve our task. The programming language used is scala and we used Spark’s MLlib library to build the model. We ran it on a Community Optimized Spark 2.1 cluster with 6GB of memory.

Initially, we merged the Training set values with the corresponding Training set labels into a single CSV file with the help of ID column. Having the label in the same place as the features makes it easy to build the model. Then, we mapped the class labels, non-functional, functional, functional needs repair to 0, 1, 2 respectively.

We used 80% of the original data as the training data to build our model. We evaluated the model on rest of the 20% data. We used the Mean Square error metric to do that task. As this is a multi-label classification problem, we experimented with several classification techniques and found Logistic regression to be appropriate. We used the MLlib’s Logistic regression, to build our model and we obtained labels for the Test set values CSV file. We used assembler, normalizer and pipeline concepts to build the model.

Results

We have split the data into 80-20 and calculated the Mean Squared Error with the help of MLlib’s RegressionMetrics and found the value to be 0.257, which is decent.
We have classified our test data. A screenshot of a few results is provided below.