Merge pull request #70 from CartoDB/segmentation

Segmentation pull request
2016-06-29 18:35:01 +02:00
parent ab349fbdc0 d97231f604
commit 2fe50e1f71
14 changed files with 2502 additions and 0 deletions
--- a/doc/04_pyAgg.md
+++ b/doc/04_pyAgg.md
@@ -0,0 +1,23 @@
+## PyAgg Helper Function 
+
+### CDB_pyAgg (columns Numeric[])
+
+Currently it's not possible to pass a multidiemensional array between plpsql and plpythonu. This function aims to
+help fix that by aggergating the columns provided in the argument across rows in to a rows * columns + 1 length 1D array. The first element of the array is the array\_length of the columns argument so that python can reconstruct 
+the 2D array. 
+
+#### Arguments
+
+| Name | Type | Description |
+|------|------|-------------|
+| columns | NUMERIC[] | The columns to aggregate across rows|
+
+#### Returns
+
+A table with the following columns.
+
+| Column Name | Type | Description |
+|-------------|------|-------------|
+| result | NUMERIC[] | An columns * rows + 1 array where the first entry is the no of columns|
+
+
--- a/doc/12_segmentation.md
+++ b/doc/12_segmentation.md
@@ -0,0 +1,83 @@
+
+## Segmentation Functions
+
+### CDB_CreateAndPredictSegment(query TEXT, variable_name TEXT, target_query TEXT)
+
+This function trains a [Gradient Boosting](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) model to attempt to predict the target data and then generates predictions for new data.  
+
+#### Arguments
+
+| Name | Type | Description |
+|------|------|-------------|
+| query | TEXT | The input query to train the algorithm, which should have both the variable of interest and the features that will be used to predict it |
+| variable\_name| TEXT | Specify the variable in the query to predict, all other columns are assumed to be features |
+| target\_table | TEXT | The query which returns the `cartodb_id` and features for the rows your would like to predict the target variable for |
+| n\_estimators (optional) | INTEGER DEFAULT 1200 | Number of estimators to be used. Values should be between 1 and x. |
+| max\_depth (optional) | INTEGER DEFAULT 3 | Max tree depth. Values should be between 1 and n. |
+| subsample (optional)  | DOUBLE PRECISION DEFAULT 0.5 | Subsample parameter for GradientBooster. Values should be within the range 0 to 1. |
+| learning\_rate (optional) | DOUBLE PRECISION DEFAULT 0.01 | Learning rate for the GradientBooster. Values should be between 0 and 1 (??) |
+| min\_samples\_leaf (optional) | INTEGER DEFAULT 1 | Minimum samples to use per leaf. Values should range from x to y |
+
+#### Returns
+
+A table with the following columns.
+
+| Column Name | Type | Description |
+|-------------|------|-------------|
+| cartodb\_id | INTEGER | The CartoDB id of the row in the target\_query |
+| prediction | NUMERIC | The predicted value of the variable of interest |
+| accuracy | NUMERIC | The mean squared accuracy of the model. |
+
+#### Example Usage
+
+```sql
+SELECT * from cdb_crankshaft.CDB_CreateAndPredictSegment(
+'SELECT agg, median_rent::numeric, male_pop::numeric, female_pop::numeric FROM late_night_agg',
+'agg',
+'SELECT row_number() OVER () As cartodb_id, median_rent, male_pop, female_pop FROM ml_learning_ny');                               
+```
+
+### CDB_CreateAndPredictSegment(target numeric[], train_features numeric[], prediction_features numeric[], prediction_ids numeric[])
+
+This function trains a [Gradient Boosting](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) model to attempt to predict the target data and then generates predictions for new data.  
+
+
+#### Arguments
+
+| Name | Type | Description |
+|------|------|-------------|
+| target | numeric[] | An array of target values of the variable you want to predict|
+| train\_features| numeric[] | 1D array of length n features \* n\_rows + 1 with the first entry in the array being the number of features in each row. These are the features the model will be trained on. CDB\_Crankshaft.CDB_pyAgg(Array[feature1, feature2, feature3]::numeric[]) can be used to construct this. |
+| prediction\_features | numeric[] | 1D array of length nfeatures\* n\_rows\_ + 1 with the first entry in the array being the number of features in each row. These are the features that will be used to predict the target variable  CDB\_Crankshaft.CDB\_pyAgg(Array[feature1, feature2, feature3]::numeric[]) can be used to construct this.  |
+| prediction\_ids | numeric[] | 1D array of length n\_rows with the ids that can use used to re-join the data with inputs |
+| n\_estimators (optional) | INTEGER DEFAULT 1200 | Number of estimators to be used |
+| max\_depth (optional) | INTEGER DEFAULT 3 | Max tree depth |
+| subsample (optional)  | DOUBLE PRECISION DEFAULT 0.5 | Subsample parameter for GradientBooster|
+| learning\_rate (optional) | DOUBLE PRECISION DEFAULT 0.01 | Learning rate for the GradientBooster |
+| min\_samples\_leaf (optional) | INTEGER DEFAULT 1 | Minimum samples to use per leaf |
+
+
+#### Returns
+
+A table with the following columns.
+
+| Column Name | Type | Description |
+|-------------|------|-------------|
+| cartodb\_id | INTEGER | The CartoDB id of the row in the target\_query |
+| prediction | NUMERIC | The predicted value of the variable of interest |
+| accuracy | NUMERIC | The mean squared accuracy of the model. |
+
+#### Example Usage
+
+```sql
+WITH training As (
+    SELECT array_agg(agg) As target,
+           cdb_crankshaft.CDB_PyAgg(Array[median_rent, male_pop, female_pop]::Numeric[]) As features
+    FROM late_night_agg),
+target AS (
+    SELECT cdb_crankshaft.CDB_PyAgg(Array[median_rent, male_pop, female_pop]::Numeric[]) As features,
+     array_agg(cartodb_id) As cartodb_ids FROM late_night_agg)  
+
+SELECT cdb_crankshaft.CDB_CreateAndPredictSegment(training.target, training.features, target.features, target.cartodb_ids)
+FROM training, target;
+```
--- a/release/python/0.0.1/crankshaft/crankshaft/init.py
+++ b/release/python/0.0.1/crankshaft/crankshaft/init.py
@@ -1,2 +1,3 @@
 import random_seeds
 import clustering
+import segmentation
--- a/src/pg/sql/04_py_agg.sql
+++ b/src/pg/sql/04_py_agg.sql
@@ -0,0 +1,19 @@
+CREATE OR REPLACE FUNCTION
+    CDB_PyAggS(current_state Numeric[], current_row Numeric[]) 
+    returns NUMERIC[] as $$
+    BEGIN
+        if array_upper(current_state,1) is null  then
+            RAISE NOTICE 'setting state %',array_upper(current_row,1);
+            current_state[1] = array_upper(current_row,1);
+        end if;
+        return array_cat(current_state,current_row) ;
+    END
+    $$ LANGUAGE plpgsql;
+
+
+CREATE AGGREGATE CDB_PyAgg(NUMERIC[])(
+    SFUNC = CDB_PyAggS,
+    STYPE = Numeric[],
+    INITCOND = "{}" 
+);
+
--- a/src/pg/sql/05_segmentation.sql
+++ b/src/pg/sql/05_segmentation.sql
@@ -0,0 +1,53 @@
+
+CREATE OR REPLACE FUNCTION
+  CDB_CreateAndPredictSegment(
+    target NUMERIC[],
+    features NUMERIC[],
+    target_features NUMERIC[],
+    target_ids NUMERIC[],
+    n_estimators INTEGER DEFAULT 1200,
+    max_depth INTEGER DEFAULT 3,
+    subsample DOUBLE PRECISION DEFAULT 0.5,
+    learning_rate DOUBLE PRECISION DEFAULT 0.01,
+    min_samples_leaf INTEGER DEFAULT 1)
+RETURNS TABLE(cartodb_id NUMERIC, prediction NUMERIC, accuracy NUMERIC)
+AS $$
+    import numpy as np
+    import plpy
+
+    from crankshaft.segmentation import create_and_predict_segment_agg
+    model_params = {'n_estimators': n_estimators,
+                    'max_depth': max_depth,
+                    'subsample': subsample,
+                    'learning_rate': learning_rate,
+                    'min_samples_leaf': min_samples_leaf}
+
+    def unpack2D(data):
+        dimension = data.pop(0)
+        a = np.array(data, dtype=float)
+        return a.reshape(len(a)/dimension, dimension)
+
+    return create_and_predict_segment_agg(np.array(target, dtype=float),
+            unpack2D(features),
+            unpack2D(target_features),
+            target_ids,
+            model_params)
+
+$$ LANGUAGE plpythonu;
+
+CREATE OR REPLACE FUNCTION
+  CDB_CreateAndPredictSegment (
+      query TEXT,
+      variable_name TEXT,
+      target_table TEXT,
+      n_estimators INTEGER DEFAULT 1200,
+      max_depth INTEGER DEFAULT 3,
+      subsample DOUBLE PRECISION DEFAULT 0.5,
+      learning_rate DOUBLE PRECISION DEFAULT 0.01,
+      min_samples_leaf INTEGER DEFAULT 1)
+RETURNS TABLE (cartodb_id TEXT, prediction NUMERIC, accuracy NUMERIC)
+AS $$
+  from crankshaft.segmentation import create_and_predict_segment
+  model_params = {'n_estimators': n_estimators, 'max_depth':max_depth, 'subsample' : subsample, 'learning_rate': learning_rate, 'min_samples_leaf' : min_samples_leaf}
+  return create_and_predict_segment(query,variable_name,target_table, model_params)
+$$ LANGUAGE plpythonu;
--- a/src/pg/test/expected/06_segmentation_test.out
+++ b/src/pg/test/expected/06_segmentation_test.out
@@ -0,0 +1,27 @@
+\pset format unaligned
+\set ECHO none
+_cdb_random_seeds
+
+(1 row)
+within_tolerance
+t
+t
+t
+t
+t
+t
+t
+t
+t
+t
+t
+t
+t
+t
+t
+t
+t
+t
+t
+t
+(20 rows)
--- a/src/pg/test/fixtures/ml_values.sql
+++ b/src/pg/test/fixtures/ml_values.sql
--- a/src/pg/test/sql/06_segmentation_test.sql
+++ b/src/pg/test/sql/06_segmentation_test.sql
@@ -0,0 +1,33 @@
+\pset format unaligned
+\set ECHO none
+\i test/fixtures/ml_values.sql
+SELECT cdb_crankshaft._cdb_random_seeds(1234);
+
+WITH expected AS (
+  SELECT generate_series(1000,1020) AS id, unnest(ARRAY[
+    4.5656517130822492,
+    1.7928053473230694,
+    1.0283378773916563,
+    2.6586517814904593,
+    2.9699056242935944,
+    3.9550646059951347,
+    4.1662572444459745,
+    3.8126334839264162,
+    1.8809821053623488,
+    1.6349065129019873,
+    3.0391288591472954,
+    3.3035970359672553,
+    1.5835471589451968,
+    3.7530378537263638,
+    1.0833589653009252,
+    3.8104965452882897,
+    2.665217959294802,
+    1.5850334252802472,
+    3.679401198805563,
+    3.5332033186588636
+    ]) AS expected LIMIT 20
+), prediction AS (
+  SELECT cartodb_id::integer id, prediction
+  FROM cdb_crankshaft.CDB_CreateAndPredictSegment('SELECT target, x1, x2, x3 FROM ml_values WHERE class = $$train$$','target','SELECT cartodb_id, target, x1, x2, x3 FROM ml_values WHERE class = $$test$$')
+  LIMIT 20
+) SELECT abs(e.expected - p.prediction) <= 1e-9 AS within_tolerance FROM expected e, prediction p WHERE e.id = p.id;
--- a/src/py/crankshaft/crankshaft/init.py
+++ b/src/py/crankshaft/crankshaft/init.py
@@ -2,3 +2,4 @@
 import crankshaft.random_seeds
 import crankshaft.clustering
 import crankshaft.space_time_dynamics
+import crankshaft.segmentation
--- a/src/py/crankshaft/crankshaft/segmentation/init.py
+++ b/src/py/crankshaft/crankshaft/segmentation/init.py
@@ -0,0 +1 @@
+from segmentation import * 
--- a/src/py/crankshaft/crankshaft/segmentation/segmentation.py
+++ b/src/py/crankshaft/crankshaft/segmentation/segmentation.py
@@ -0,0 +1,176 @@
+"""
+Segmentation creation and prediction
+"""
+
+import sklearn
+import numpy as np
+import plpy
+from sklearn.ensemble import GradientBoostingRegressor
+from sklearn import metrics
+from sklearn.cross_validation import train_test_split
+
+# Lower level functions
+#----------------------
+
+def replace_nan_with_mean(array):
+    """
+        Input:
+            @param array: an array of floats which may have null-valued entries
+        Output:
+            array with nans filled in with the mean of the dataset
+    """
+    # returns an array of rows and column indices
+    indices = np.where(np.isnan(array))
+
+    # iterate through entries which have nan values
+    for row, col in zip(*indices):
+            array[row, col] = np.mean(array[~np.isnan(array[:, col]), col])
+
+    return array
+
+def get_data(variable, feature_columns, query):
+    """
+        Fetch data from the database, clean, and package into
+          numpy arrays
+        Input:
+            @param variable: name of the target variable
+            @param feature_columns: list of column names
+            @param query: subquery that data is pulled from for the packaging
+        Output:
+            prepared data, packaged into NumPy arrays
+    """
+
+    columns = ','.join(['array_agg("{col}") As "{col}"'.format(col=col) for col in feature_columns])
+
+    try:
+        data = plpy.execute('''SELECT array_agg("{variable}") As target, {columns} FROM ({query}) As a'''.format(
+            variable=variable,
+            columns=columns,
+            query=query))
+    except Exception, e:
+        plpy.error('Failed to access data to build segmentation model: %s' % e)
+
+    # extract target data from plpy object
+    target = np.array(data[0]['target'])
+
+    # put n feature data arrays into an n x m array of arrays
+    features = np.column_stack([np.array(data[0][col], dtype=float) for col in feature_columns])
+
+    return replace_nan_with_mean(target), replace_nan_with_mean(features)
+
+# High level interface
+# --------------------
+
+def create_and_predict_segment_agg(target, features, target_features, target_ids, model_parameters):
+    """
+    Version of create_and_predict_segment that works on arrays that come stright form the SQL calling
+    the function.
+
+        Input:
+            @param target: The 1D array of lenth NSamples containing the target variable we want the model to predict
+            @param features: Thw 2D array of size NSamples * NFeatures that form the imput to the model
+            @param target_ids: A 1D array of target_ids that will be used to associate the results of the prediction with the rows which they come from
+            @param model_parameters: A dictionary containing parameters for the model.
+    """
+
+    clean_target = replace_nan_with_mean(target)
+    clean_features = replace_nan_with_mean(features)
+    target_features = replace_nan_with_mean(target_features)
+
+    model, accuracy = train_model(clean_target, clean_features, model_parameters, 0.2)
+    prediction = model.predict(target_features)
+    accuracy_array = [accuracy]*prediction.shape[0]
+    return zip(target_ids, prediction, np.full(prediction.shape, accuracy_array))
+
+
+
+def create_and_predict_segment(query, variable, target_query, model_params):
+    """
+    generate a segment with machine learning
+    Stuart Lynn
+    """
+
+    ## fetch column names
+    try:
+        columns = plpy.execute('SELECT * FROM ({query}) As a LIMIT 1  '.format(query=query))[0].keys()
+    except Exception, e:
+        plpy.error('Failed to build segmentation model: %s' % e)
+
+    ## extract column names to be used in building the segmentation model
+    feature_columns = set(columns) - set([variable, 'cartodb_id', 'the_geom', 'the_geom_webmercator'])
+    ## get data from database
+    target, features = get_data(variable, feature_columns, query)
+
+    model, accuracy = train_model(target, features, model_params, 0.2)
+    cartodb_ids, result = predict_segment(model, feature_columns, target_query)
+    accuracy_array = [accuracy]*result.shape[0]
+    return zip(cartodb_ids, result, accuracy_array)
+
+
+def train_model(target, features, model_params, test_split):
+    """
+        Train the Gradient Boosting model on the provided data and calculate the accuracy of the model
+        Input:
+            @param target: 1D Array of the variable that the model is to be trianed to predict
+            @param features: 2D Array NSamples * NFeatures to use in trining the model
+            @param model_params: A dictionary of model parameters, the full specification can be found on the
+                scikit learn page for [GradientBoostingRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)
+            @parma test_split: The fraction of the data to be withheld for testing the model / calculating the accuray
+    """
+    features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=test_split)
+    model = GradientBoostingRegressor(**model_params)
+    model.fit(features_train, target_train)
+    accuracy = calculate_model_accuracy(model, features, target)
+    return model, accuracy
+
+def calculate_model_accuracy(model, features, target):
+    """
+        Calculate the mean squared error of the model prediction
+        Input:
+            @param model: model trained from input features
+            @param features: features to make a prediction from
+            @param target: target to compare prediction to
+        Output:
+            mean squared error of the model prection compared to the target
+    """
+    prediction = model.predict(features)
+    return metrics.mean_squared_error(prediction, target)
+
+def predict_segment(model, features, target_query):
+    """
+    Use the provided model to predict the values for the new feature set
+        Input:
+            @param model: The pretrained model
+            @features: A list of features to use in the model prediction (list of column names)
+            @target_query: The query to run to obtain the data to predict on and the cartdb_ids associated with it.
+    """
+
+    batch_size = 1000
+    joined_features = ','.join(['"{0}"::numeric'.format(a) for a in features])
+
+    try:
+        cursor = plpy.cursor('SELECT Array[{joined_features}] As features FROM ({target_query}) As a'.format(
+            joined_features=joined_features,
+            target_query=target_query))
+    except Exception, e:
+        plpy.error('Failed to build segmentation model: %s' % e)
+
+    results = []
+
+    while True:
+        rows = cursor.fetch(batch_size)
+        if not rows:
+            break
+        batch = np.row_stack([np.array(row['features'], dtype=float) for row in rows])
+
+        #Need to fix this. Should be global mean. This will cause weird effects
+        batch = replace_nan_with_mean(batch)
+        prediction = model.predict(batch)
+        results.append(prediction)
+
+    try:
+        cartodb_ids = plpy.execute('''SELECT array_agg(cartodb_id ORDER BY cartodb_id) As cartodb_ids FROM ({0}) As a'''.format(target_query))[0]['cartodb_ids']
+    except Exception, e:
+        plpy.error('Failed to build segmentation model: %s' % e)
+
+    return cartodb_ids, np.concatenate(results)
--- a/src/py/crankshaft/setup.py
+++ b/src/py/crankshaft/setup.py
@@ -40,6 +40,7 @@ setup(

    # The choice of component versions is dictated by what's
    # provisioned in the production servers.
+    # IMPORTANT NOTE: please don't change this line. Instead issue a ticket to systems for evaluation.
    install_requires=['joblib==0.8.3', 'numpy==1.6.1', 'scipy==0.14.0', 'pysal==1.11.2', 'scikit-learn==0.14.1'],

    requires=['pysal', 'numpy', 'sklearn'],
--- a/src/py/crankshaft/test/mock_plpy.py
+++ b/src/py/crankshaft/test/mock_plpy.py
@@ -1,5 +1,16 @@
 import re

+class MockCursor:
+    def __init__(self, data):
+        self.cursor_pos = 0
+        self.data = data
+
+    def fetch(self, batch_size):
+        batch = self.data[self.cursor_pos : self.cursor_pos + batch_size]
+        self.cursor_pos += batch_size
+        return batch
+
+
 class MockPlPy:
    def __init__(self):
        self._reset()
@@ -30,6 +41,10 @@ class MockPlPy:
    def info(self, msg):
        self.infos.append(msg)

+    def cursor(self, query):
+        data = self.execute(query)
+        return MockCursor(data)
+
    def execute(self, query): # TODO: additional arguments
       for result in self.results:
          if result[0].match(query):
--- a/src/py/crankshaft/test/test_segmentation.py
+++ b/src/py/crankshaft/test/test_segmentation.py
@@ -0,0 +1,64 @@
+import unittest
+import numpy as np
+from helper import plpy, fixture_file
+import crankshaft.segmentation as segmentation
+import json
+
+class SegmentationTest(unittest.TestCase):
+    """Testing class for Moran's I functions"""
+
+    def setUp(self):
+        plpy._reset()
+
+    def generate_random_data(self,n_samples,random_state,  row_type=False):
+        x1 = random_state.uniform(size=n_samples)
+        x2 = random_state.uniform(size=n_samples)
+        x3 = random_state.randint(0, 4, size=n_samples)
+
+        y = x1+x2*x2+x3
+        cartodb_id  = range(len(x1))
+
+        if row_type:
+            return [ {'features': vals} for vals in zip(x1,x2,x3)], y
+        else:
+            return  [dict( zip(['x1','x2','x3','target', 'cartodb_id'],[x1,x2,x3,y,cartodb_id]))]
+
+    def test_replace_nan_with_mean(self):
+        test_array = np.array([1.2, np.nan, 3.2, np.nan, np.nan])
+
+    def test_create_and_predict_segment(self):
+        n_samples = 1000
+
+        random_state_train = np.random.RandomState(13)
+        random_state_test = np.random.RandomState(134)
+        training_data = self.generate_random_data(n_samples, random_state_train)
+        test_data, test_y = self.generate_random_data(n_samples, random_state_test, row_type=True)
+
+
+        ids =  [{'cartodb_ids': range(len(test_data))}]
+        rows =  [{'x1': 0,'x2':0,'x3':0,'y':0,'cartodb_id':0}]
+
+        plpy._define_result('select \* from  \(select \* from training\) a  limit 1',rows)
+        plpy._define_result('.*from \(select \* from training\) as a' ,training_data)
+        plpy._define_result('select array_agg\(cartodb\_id order by cartodb\_id\) as cartodb_ids from \(.*\) a',ids)
+        plpy._define_result('.*select \* from test.*' ,test_data)
+
+        model_parameters =  {'n_estimators': 1200,
+                             'max_depth': 3,
+                             'subsample' : 0.5,
+                             'learning_rate': 0.01,
+                             'min_samples_leaf': 1}
+
+        result = segmentation.create_and_predict_segment(
+                'select * from training',
+                'target',
+                'select * from test',
+                model_parameters)
+
+        prediction = [r[1] for r in result]
+
+        accuracy =np.sqrt(np.mean( np.square( np.array(prediction) - np.array(test_y))))
+
+        self.assertEqual(len(result),len(test_data))
+        self.assertTrue( result[0][2] < 0.01)
+        self.assertTrue( accuracy < 0.5*np.mean(test_y)  )