Merge pull request #70 from CartoDB/segmentation
Segmentation pull request
This commit is contained in:
23
doc/04_pyAgg.md
Normal file
23
doc/04_pyAgg.md
Normal file
@@ -0,0 +1,23 @@
|
||||
## PyAgg Helper Function
|
||||
|
||||
### CDB_pyAgg (columns Numeric[])
|
||||
|
||||
Currently it's not possible to pass a multidiemensional array between plpsql and plpythonu. This function aims to
|
||||
help fix that by aggergating the columns provided in the argument across rows in to a rows * columns + 1 length 1D array. The first element of the array is the array\_length of the columns argument so that python can reconstruct
|
||||
the 2D array.
|
||||
|
||||
#### Arguments
|
||||
|
||||
| Name | Type | Description |
|
||||
|------|------|-------------|
|
||||
| columns | NUMERIC[] | The columns to aggregate across rows|
|
||||
|
||||
#### Returns
|
||||
|
||||
A table with the following columns.
|
||||
|
||||
| Column Name | Type | Description |
|
||||
|-------------|------|-------------|
|
||||
| result | NUMERIC[] | An columns * rows + 1 array where the first entry is the no of columns|
|
||||
|
||||
|
||||
83
doc/12_segmentation.md
Normal file
83
doc/12_segmentation.md
Normal file
@@ -0,0 +1,83 @@
|
||||
|
||||
## Segmentation Functions
|
||||
|
||||
### CDB_CreateAndPredictSegment(query TEXT, variable_name TEXT, target_query TEXT)
|
||||
|
||||
This function trains a [Gradient Boosting](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) model to attempt to predict the target data and then generates predictions for new data.
|
||||
|
||||
#### Arguments
|
||||
|
||||
| Name | Type | Description |
|
||||
|------|------|-------------|
|
||||
| query | TEXT | The input query to train the algorithm, which should have both the variable of interest and the features that will be used to predict it |
|
||||
| variable\_name| TEXT | Specify the variable in the query to predict, all other columns are assumed to be features |
|
||||
| target\_table | TEXT | The query which returns the `cartodb_id` and features for the rows your would like to predict the target variable for |
|
||||
| n\_estimators (optional) | INTEGER DEFAULT 1200 | Number of estimators to be used. Values should be between 1 and x. |
|
||||
| max\_depth (optional) | INTEGER DEFAULT 3 | Max tree depth. Values should be between 1 and n. |
|
||||
| subsample (optional) | DOUBLE PRECISION DEFAULT 0.5 | Subsample parameter for GradientBooster. Values should be within the range 0 to 1. |
|
||||
| learning\_rate (optional) | DOUBLE PRECISION DEFAULT 0.01 | Learning rate for the GradientBooster. Values should be between 0 and 1 (??) |
|
||||
| min\_samples\_leaf (optional) | INTEGER DEFAULT 1 | Minimum samples to use per leaf. Values should range from x to y |
|
||||
|
||||
#### Returns
|
||||
|
||||
A table with the following columns.
|
||||
|
||||
| Column Name | Type | Description |
|
||||
|-------------|------|-------------|
|
||||
| cartodb\_id | INTEGER | The CartoDB id of the row in the target\_query |
|
||||
| prediction | NUMERIC | The predicted value of the variable of interest |
|
||||
| accuracy | NUMERIC | The mean squared accuracy of the model. |
|
||||
|
||||
#### Example Usage
|
||||
|
||||
```sql
|
||||
SELECT * from cdb_crankshaft.CDB_CreateAndPredictSegment(
|
||||
'SELECT agg, median_rent::numeric, male_pop::numeric, female_pop::numeric FROM late_night_agg',
|
||||
'agg',
|
||||
'SELECT row_number() OVER () As cartodb_id, median_rent, male_pop, female_pop FROM ml_learning_ny');
|
||||
```
|
||||
|
||||
### CDB_CreateAndPredictSegment(target numeric[], train_features numeric[], prediction_features numeric[], prediction_ids numeric[])
|
||||
|
||||
This function trains a [Gradient Boosting](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) model to attempt to predict the target data and then generates predictions for new data.
|
||||
|
||||
|
||||
#### Arguments
|
||||
|
||||
| Name | Type | Description |
|
||||
|------|------|-------------|
|
||||
| target | numeric[] | An array of target values of the variable you want to predict|
|
||||
| train\_features| numeric[] | 1D array of length n features \* n\_rows + 1 with the first entry in the array being the number of features in each row. These are the features the model will be trained on. CDB\_Crankshaft.CDB_pyAgg(Array[feature1, feature2, feature3]::numeric[]) can be used to construct this. |
|
||||
| prediction\_features | numeric[] | 1D array of length nfeatures\* n\_rows\_ + 1 with the first entry in the array being the number of features in each row. These are the features that will be used to predict the target variable CDB\_Crankshaft.CDB\_pyAgg(Array[feature1, feature2, feature3]::numeric[]) can be used to construct this. |
|
||||
| prediction\_ids | numeric[] | 1D array of length n\_rows with the ids that can use used to re-join the data with inputs |
|
||||
| n\_estimators (optional) | INTEGER DEFAULT 1200 | Number of estimators to be used |
|
||||
| max\_depth (optional) | INTEGER DEFAULT 3 | Max tree depth |
|
||||
| subsample (optional) | DOUBLE PRECISION DEFAULT 0.5 | Subsample parameter for GradientBooster|
|
||||
| learning\_rate (optional) | DOUBLE PRECISION DEFAULT 0.01 | Learning rate for the GradientBooster |
|
||||
| min\_samples\_leaf (optional) | INTEGER DEFAULT 1 | Minimum samples to use per leaf |
|
||||
|
||||
|
||||
#### Returns
|
||||
|
||||
A table with the following columns.
|
||||
|
||||
| Column Name | Type | Description |
|
||||
|-------------|------|-------------|
|
||||
| cartodb\_id | INTEGER | The CartoDB id of the row in the target\_query |
|
||||
| prediction | NUMERIC | The predicted value of the variable of interest |
|
||||
| accuracy | NUMERIC | The mean squared accuracy of the model. |
|
||||
|
||||
#### Example Usage
|
||||
|
||||
```sql
|
||||
WITH training As (
|
||||
SELECT array_agg(agg) As target,
|
||||
cdb_crankshaft.CDB_PyAgg(Array[median_rent, male_pop, female_pop]::Numeric[]) As features
|
||||
FROM late_night_agg),
|
||||
target AS (
|
||||
SELECT cdb_crankshaft.CDB_PyAgg(Array[median_rent, male_pop, female_pop]::Numeric[]) As features,
|
||||
array_agg(cartodb_id) As cartodb_ids FROM late_night_agg)
|
||||
|
||||
SELECT cdb_crankshaft.CDB_CreateAndPredictSegment(training.target, training.features, target.features, target.cartodb_ids)
|
||||
FROM training, target;
|
||||
```
|
||||
@@ -1,2 +1,3 @@
|
||||
import random_seeds
|
||||
import clustering
|
||||
import segmentation
|
||||
|
||||
19
src/pg/sql/04_py_agg.sql
Normal file
19
src/pg/sql/04_py_agg.sql
Normal file
@@ -0,0 +1,19 @@
|
||||
CREATE OR REPLACE FUNCTION
|
||||
CDB_PyAggS(current_state Numeric[], current_row Numeric[])
|
||||
returns NUMERIC[] as $$
|
||||
BEGIN
|
||||
if array_upper(current_state,1) is null then
|
||||
RAISE NOTICE 'setting state %',array_upper(current_row,1);
|
||||
current_state[1] = array_upper(current_row,1);
|
||||
end if;
|
||||
return array_cat(current_state,current_row) ;
|
||||
END
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
|
||||
CREATE AGGREGATE CDB_PyAgg(NUMERIC[])(
|
||||
SFUNC = CDB_PyAggS,
|
||||
STYPE = Numeric[],
|
||||
INITCOND = "{}"
|
||||
);
|
||||
|
||||
53
src/pg/sql/05_segmentation.sql
Normal file
53
src/pg/sql/05_segmentation.sql
Normal file
@@ -0,0 +1,53 @@
|
||||
|
||||
CREATE OR REPLACE FUNCTION
|
||||
CDB_CreateAndPredictSegment(
|
||||
target NUMERIC[],
|
||||
features NUMERIC[],
|
||||
target_features NUMERIC[],
|
||||
target_ids NUMERIC[],
|
||||
n_estimators INTEGER DEFAULT 1200,
|
||||
max_depth INTEGER DEFAULT 3,
|
||||
subsample DOUBLE PRECISION DEFAULT 0.5,
|
||||
learning_rate DOUBLE PRECISION DEFAULT 0.01,
|
||||
min_samples_leaf INTEGER DEFAULT 1)
|
||||
RETURNS TABLE(cartodb_id NUMERIC, prediction NUMERIC, accuracy NUMERIC)
|
||||
AS $$
|
||||
import numpy as np
|
||||
import plpy
|
||||
|
||||
from crankshaft.segmentation import create_and_predict_segment_agg
|
||||
model_params = {'n_estimators': n_estimators,
|
||||
'max_depth': max_depth,
|
||||
'subsample': subsample,
|
||||
'learning_rate': learning_rate,
|
||||
'min_samples_leaf': min_samples_leaf}
|
||||
|
||||
def unpack2D(data):
|
||||
dimension = data.pop(0)
|
||||
a = np.array(data, dtype=float)
|
||||
return a.reshape(len(a)/dimension, dimension)
|
||||
|
||||
return create_and_predict_segment_agg(np.array(target, dtype=float),
|
||||
unpack2D(features),
|
||||
unpack2D(target_features),
|
||||
target_ids,
|
||||
model_params)
|
||||
|
||||
$$ LANGUAGE plpythonu;
|
||||
|
||||
CREATE OR REPLACE FUNCTION
|
||||
CDB_CreateAndPredictSegment (
|
||||
query TEXT,
|
||||
variable_name TEXT,
|
||||
target_table TEXT,
|
||||
n_estimators INTEGER DEFAULT 1200,
|
||||
max_depth INTEGER DEFAULT 3,
|
||||
subsample DOUBLE PRECISION DEFAULT 0.5,
|
||||
learning_rate DOUBLE PRECISION DEFAULT 0.01,
|
||||
min_samples_leaf INTEGER DEFAULT 1)
|
||||
RETURNS TABLE (cartodb_id TEXT, prediction NUMERIC, accuracy NUMERIC)
|
||||
AS $$
|
||||
from crankshaft.segmentation import create_and_predict_segment
|
||||
model_params = {'n_estimators': n_estimators, 'max_depth':max_depth, 'subsample' : subsample, 'learning_rate': learning_rate, 'min_samples_leaf' : min_samples_leaf}
|
||||
return create_and_predict_segment(query,variable_name,target_table, model_params)
|
||||
$$ LANGUAGE plpythonu;
|
||||
27
src/pg/test/expected/06_segmentation_test.out
Normal file
27
src/pg/test/expected/06_segmentation_test.out
Normal file
@@ -0,0 +1,27 @@
|
||||
\pset format unaligned
|
||||
\set ECHO none
|
||||
_cdb_random_seeds
|
||||
|
||||
(1 row)
|
||||
within_tolerance
|
||||
t
|
||||
t
|
||||
t
|
||||
t
|
||||
t
|
||||
t
|
||||
t
|
||||
t
|
||||
t
|
||||
t
|
||||
t
|
||||
t
|
||||
t
|
||||
t
|
||||
t
|
||||
t
|
||||
t
|
||||
t
|
||||
t
|
||||
t
|
||||
(20 rows)
|
||||
2005
src/pg/test/fixtures/ml_values.sql
vendored
Normal file
2005
src/pg/test/fixtures/ml_values.sql
vendored
Normal file
File diff suppressed because it is too large
Load Diff
33
src/pg/test/sql/06_segmentation_test.sql
Normal file
33
src/pg/test/sql/06_segmentation_test.sql
Normal file
@@ -0,0 +1,33 @@
|
||||
\pset format unaligned
|
||||
\set ECHO none
|
||||
\i test/fixtures/ml_values.sql
|
||||
SELECT cdb_crankshaft._cdb_random_seeds(1234);
|
||||
|
||||
WITH expected AS (
|
||||
SELECT generate_series(1000,1020) AS id, unnest(ARRAY[
|
||||
4.5656517130822492,
|
||||
1.7928053473230694,
|
||||
1.0283378773916563,
|
||||
2.6586517814904593,
|
||||
2.9699056242935944,
|
||||
3.9550646059951347,
|
||||
4.1662572444459745,
|
||||
3.8126334839264162,
|
||||
1.8809821053623488,
|
||||
1.6349065129019873,
|
||||
3.0391288591472954,
|
||||
3.3035970359672553,
|
||||
1.5835471589451968,
|
||||
3.7530378537263638,
|
||||
1.0833589653009252,
|
||||
3.8104965452882897,
|
||||
2.665217959294802,
|
||||
1.5850334252802472,
|
||||
3.679401198805563,
|
||||
3.5332033186588636
|
||||
]) AS expected LIMIT 20
|
||||
), prediction AS (
|
||||
SELECT cartodb_id::integer id, prediction
|
||||
FROM cdb_crankshaft.CDB_CreateAndPredictSegment('SELECT target, x1, x2, x3 FROM ml_values WHERE class = $$train$$','target','SELECT cartodb_id, target, x1, x2, x3 FROM ml_values WHERE class = $$test$$')
|
||||
LIMIT 20
|
||||
) SELECT abs(e.expected - p.prediction) <= 1e-9 AS within_tolerance FROM expected e, prediction p WHERE e.id = p.id;
|
||||
@@ -2,3 +2,4 @@
|
||||
import crankshaft.random_seeds
|
||||
import crankshaft.clustering
|
||||
import crankshaft.space_time_dynamics
|
||||
import crankshaft.segmentation
|
||||
|
||||
1
src/py/crankshaft/crankshaft/segmentation/__init__.py
Normal file
1
src/py/crankshaft/crankshaft/segmentation/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
from segmentation import *
|
||||
176
src/py/crankshaft/crankshaft/segmentation/segmentation.py
Normal file
176
src/py/crankshaft/crankshaft/segmentation/segmentation.py
Normal file
@@ -0,0 +1,176 @@
|
||||
"""
|
||||
Segmentation creation and prediction
|
||||
"""
|
||||
|
||||
import sklearn
|
||||
import numpy as np
|
||||
import plpy
|
||||
from sklearn.ensemble import GradientBoostingRegressor
|
||||
from sklearn import metrics
|
||||
from sklearn.cross_validation import train_test_split
|
||||
|
||||
# Lower level functions
|
||||
#----------------------
|
||||
|
||||
def replace_nan_with_mean(array):
|
||||
"""
|
||||
Input:
|
||||
@param array: an array of floats which may have null-valued entries
|
||||
Output:
|
||||
array with nans filled in with the mean of the dataset
|
||||
"""
|
||||
# returns an array of rows and column indices
|
||||
indices = np.where(np.isnan(array))
|
||||
|
||||
# iterate through entries which have nan values
|
||||
for row, col in zip(*indices):
|
||||
array[row, col] = np.mean(array[~np.isnan(array[:, col]), col])
|
||||
|
||||
return array
|
||||
|
||||
def get_data(variable, feature_columns, query):
|
||||
"""
|
||||
Fetch data from the database, clean, and package into
|
||||
numpy arrays
|
||||
Input:
|
||||
@param variable: name of the target variable
|
||||
@param feature_columns: list of column names
|
||||
@param query: subquery that data is pulled from for the packaging
|
||||
Output:
|
||||
prepared data, packaged into NumPy arrays
|
||||
"""
|
||||
|
||||
columns = ','.join(['array_agg("{col}") As "{col}"'.format(col=col) for col in feature_columns])
|
||||
|
||||
try:
|
||||
data = plpy.execute('''SELECT array_agg("{variable}") As target, {columns} FROM ({query}) As a'''.format(
|
||||
variable=variable,
|
||||
columns=columns,
|
||||
query=query))
|
||||
except Exception, e:
|
||||
plpy.error('Failed to access data to build segmentation model: %s' % e)
|
||||
|
||||
# extract target data from plpy object
|
||||
target = np.array(data[0]['target'])
|
||||
|
||||
# put n feature data arrays into an n x m array of arrays
|
||||
features = np.column_stack([np.array(data[0][col], dtype=float) for col in feature_columns])
|
||||
|
||||
return replace_nan_with_mean(target), replace_nan_with_mean(features)
|
||||
|
||||
# High level interface
|
||||
# --------------------
|
||||
|
||||
def create_and_predict_segment_agg(target, features, target_features, target_ids, model_parameters):
|
||||
"""
|
||||
Version of create_and_predict_segment that works on arrays that come stright form the SQL calling
|
||||
the function.
|
||||
|
||||
Input:
|
||||
@param target: The 1D array of lenth NSamples containing the target variable we want the model to predict
|
||||
@param features: Thw 2D array of size NSamples * NFeatures that form the imput to the model
|
||||
@param target_ids: A 1D array of target_ids that will be used to associate the results of the prediction with the rows which they come from
|
||||
@param model_parameters: A dictionary containing parameters for the model.
|
||||
"""
|
||||
|
||||
clean_target = replace_nan_with_mean(target)
|
||||
clean_features = replace_nan_with_mean(features)
|
||||
target_features = replace_nan_with_mean(target_features)
|
||||
|
||||
model, accuracy = train_model(clean_target, clean_features, model_parameters, 0.2)
|
||||
prediction = model.predict(target_features)
|
||||
accuracy_array = [accuracy]*prediction.shape[0]
|
||||
return zip(target_ids, prediction, np.full(prediction.shape, accuracy_array))
|
||||
|
||||
|
||||
|
||||
def create_and_predict_segment(query, variable, target_query, model_params):
|
||||
"""
|
||||
generate a segment with machine learning
|
||||
Stuart Lynn
|
||||
"""
|
||||
|
||||
## fetch column names
|
||||
try:
|
||||
columns = plpy.execute('SELECT * FROM ({query}) As a LIMIT 1 '.format(query=query))[0].keys()
|
||||
except Exception, e:
|
||||
plpy.error('Failed to build segmentation model: %s' % e)
|
||||
|
||||
## extract column names to be used in building the segmentation model
|
||||
feature_columns = set(columns) - set([variable, 'cartodb_id', 'the_geom', 'the_geom_webmercator'])
|
||||
## get data from database
|
||||
target, features = get_data(variable, feature_columns, query)
|
||||
|
||||
model, accuracy = train_model(target, features, model_params, 0.2)
|
||||
cartodb_ids, result = predict_segment(model, feature_columns, target_query)
|
||||
accuracy_array = [accuracy]*result.shape[0]
|
||||
return zip(cartodb_ids, result, accuracy_array)
|
||||
|
||||
|
||||
def train_model(target, features, model_params, test_split):
|
||||
"""
|
||||
Train the Gradient Boosting model on the provided data and calculate the accuracy of the model
|
||||
Input:
|
||||
@param target: 1D Array of the variable that the model is to be trianed to predict
|
||||
@param features: 2D Array NSamples * NFeatures to use in trining the model
|
||||
@param model_params: A dictionary of model parameters, the full specification can be found on the
|
||||
scikit learn page for [GradientBoostingRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)
|
||||
@parma test_split: The fraction of the data to be withheld for testing the model / calculating the accuray
|
||||
"""
|
||||
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=test_split)
|
||||
model = GradientBoostingRegressor(**model_params)
|
||||
model.fit(features_train, target_train)
|
||||
accuracy = calculate_model_accuracy(model, features, target)
|
||||
return model, accuracy
|
||||
|
||||
def calculate_model_accuracy(model, features, target):
|
||||
"""
|
||||
Calculate the mean squared error of the model prediction
|
||||
Input:
|
||||
@param model: model trained from input features
|
||||
@param features: features to make a prediction from
|
||||
@param target: target to compare prediction to
|
||||
Output:
|
||||
mean squared error of the model prection compared to the target
|
||||
"""
|
||||
prediction = model.predict(features)
|
||||
return metrics.mean_squared_error(prediction, target)
|
||||
|
||||
def predict_segment(model, features, target_query):
|
||||
"""
|
||||
Use the provided model to predict the values for the new feature set
|
||||
Input:
|
||||
@param model: The pretrained model
|
||||
@features: A list of features to use in the model prediction (list of column names)
|
||||
@target_query: The query to run to obtain the data to predict on and the cartdb_ids associated with it.
|
||||
"""
|
||||
|
||||
batch_size = 1000
|
||||
joined_features = ','.join(['"{0}"::numeric'.format(a) for a in features])
|
||||
|
||||
try:
|
||||
cursor = plpy.cursor('SELECT Array[{joined_features}] As features FROM ({target_query}) As a'.format(
|
||||
joined_features=joined_features,
|
||||
target_query=target_query))
|
||||
except Exception, e:
|
||||
plpy.error('Failed to build segmentation model: %s' % e)
|
||||
|
||||
results = []
|
||||
|
||||
while True:
|
||||
rows = cursor.fetch(batch_size)
|
||||
if not rows:
|
||||
break
|
||||
batch = np.row_stack([np.array(row['features'], dtype=float) for row in rows])
|
||||
|
||||
#Need to fix this. Should be global mean. This will cause weird effects
|
||||
batch = replace_nan_with_mean(batch)
|
||||
prediction = model.predict(batch)
|
||||
results.append(prediction)
|
||||
|
||||
try:
|
||||
cartodb_ids = plpy.execute('''SELECT array_agg(cartodb_id ORDER BY cartodb_id) As cartodb_ids FROM ({0}) As a'''.format(target_query))[0]['cartodb_ids']
|
||||
except Exception, e:
|
||||
plpy.error('Failed to build segmentation model: %s' % e)
|
||||
|
||||
return cartodb_ids, np.concatenate(results)
|
||||
@@ -40,6 +40,7 @@ setup(
|
||||
|
||||
# The choice of component versions is dictated by what's
|
||||
# provisioned in the production servers.
|
||||
# IMPORTANT NOTE: please don't change this line. Instead issue a ticket to systems for evaluation.
|
||||
install_requires=['joblib==0.8.3', 'numpy==1.6.1', 'scipy==0.14.0', 'pysal==1.11.2', 'scikit-learn==0.14.1'],
|
||||
|
||||
requires=['pysal', 'numpy', 'sklearn'],
|
||||
|
||||
@@ -1,5 +1,16 @@
|
||||
import re
|
||||
|
||||
class MockCursor:
|
||||
def __init__(self, data):
|
||||
self.cursor_pos = 0
|
||||
self.data = data
|
||||
|
||||
def fetch(self, batch_size):
|
||||
batch = self.data[self.cursor_pos : self.cursor_pos + batch_size]
|
||||
self.cursor_pos += batch_size
|
||||
return batch
|
||||
|
||||
|
||||
class MockPlPy:
|
||||
def __init__(self):
|
||||
self._reset()
|
||||
@@ -30,6 +41,10 @@ class MockPlPy:
|
||||
def info(self, msg):
|
||||
self.infos.append(msg)
|
||||
|
||||
def cursor(self, query):
|
||||
data = self.execute(query)
|
||||
return MockCursor(data)
|
||||
|
||||
def execute(self, query): # TODO: additional arguments
|
||||
for result in self.results:
|
||||
if result[0].match(query):
|
||||
|
||||
64
src/py/crankshaft/test/test_segmentation.py
Normal file
64
src/py/crankshaft/test/test_segmentation.py
Normal file
@@ -0,0 +1,64 @@
|
||||
import unittest
|
||||
import numpy as np
|
||||
from helper import plpy, fixture_file
|
||||
import crankshaft.segmentation as segmentation
|
||||
import json
|
||||
|
||||
class SegmentationTest(unittest.TestCase):
|
||||
"""Testing class for Moran's I functions"""
|
||||
|
||||
def setUp(self):
|
||||
plpy._reset()
|
||||
|
||||
def generate_random_data(self,n_samples,random_state, row_type=False):
|
||||
x1 = random_state.uniform(size=n_samples)
|
||||
x2 = random_state.uniform(size=n_samples)
|
||||
x3 = random_state.randint(0, 4, size=n_samples)
|
||||
|
||||
y = x1+x2*x2+x3
|
||||
cartodb_id = range(len(x1))
|
||||
|
||||
if row_type:
|
||||
return [ {'features': vals} for vals in zip(x1,x2,x3)], y
|
||||
else:
|
||||
return [dict( zip(['x1','x2','x3','target', 'cartodb_id'],[x1,x2,x3,y,cartodb_id]))]
|
||||
|
||||
def test_replace_nan_with_mean(self):
|
||||
test_array = np.array([1.2, np.nan, 3.2, np.nan, np.nan])
|
||||
|
||||
def test_create_and_predict_segment(self):
|
||||
n_samples = 1000
|
||||
|
||||
random_state_train = np.random.RandomState(13)
|
||||
random_state_test = np.random.RandomState(134)
|
||||
training_data = self.generate_random_data(n_samples, random_state_train)
|
||||
test_data, test_y = self.generate_random_data(n_samples, random_state_test, row_type=True)
|
||||
|
||||
|
||||
ids = [{'cartodb_ids': range(len(test_data))}]
|
||||
rows = [{'x1': 0,'x2':0,'x3':0,'y':0,'cartodb_id':0}]
|
||||
|
||||
plpy._define_result('select \* from \(select \* from training\) a limit 1',rows)
|
||||
plpy._define_result('.*from \(select \* from training\) as a' ,training_data)
|
||||
plpy._define_result('select array_agg\(cartodb\_id order by cartodb\_id\) as cartodb_ids from \(.*\) a',ids)
|
||||
plpy._define_result('.*select \* from test.*' ,test_data)
|
||||
|
||||
model_parameters = {'n_estimators': 1200,
|
||||
'max_depth': 3,
|
||||
'subsample' : 0.5,
|
||||
'learning_rate': 0.01,
|
||||
'min_samples_leaf': 1}
|
||||
|
||||
result = segmentation.create_and_predict_segment(
|
||||
'select * from training',
|
||||
'target',
|
||||
'select * from test',
|
||||
model_parameters)
|
||||
|
||||
prediction = [r[1] for r in result]
|
||||
|
||||
accuracy =np.sqrt(np.mean( np.square( np.array(prediction) - np.array(test_y))))
|
||||
|
||||
self.assertEqual(len(result),len(test_data))
|
||||
self.assertTrue( result[0][2] < 0.01)
|
||||
self.assertTrue( accuracy < 0.5*np.mean(test_y) )
|
||||
Reference in New Issue
Block a user