performance imporvments

adding sklearn to deps
fixing syntax
2016-05-27 19:31:37 +00:00 · 2016-05-27 14:59:24 +00:00 · 2016-05-27 14:58:43 +00:00 · 2016-05-27 14:58:05 +00:00 · 2016-05-27 10:33:00 -04:00 · 2016-05-27 10:29:47 -04:00
21 changed files with 552 additions and 919 deletions
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -1,7 +0,0 @@
-
- [ ] All declared geometries are `geometry(Geometry, 4326)` for general geoms, or `geometry(Point, 4326)`
- [ ] Include python is activated for new functions. Include this before importing modules: `plpy.execute('SELECT cdb_crankshaft._cdb_crankshaft_activate_py()')`
- [ ] Docs for public-facing functions are written
- [ ] New functions follow the naming conventions: `CDB_NameOfFunction`. Where internal functions begin with an underscore `_`.
- [ ] If appropriate, new functions accepts an arbitrary query as an input (see [Crankshaft Issue #6](https://github.com/CartoDB/crankshaft/issues/6) for more information)
- 
--- a/doc/02_moran.md
+++ b/doc/02_moran.md
@@ -1,185 +1,169 @@
-## Areas of Interest Functions
+## Name

-### CDB_AreasOfInterestLocal(subquery text, column_name text)
+CDB_AreasOfInterest -- returns a table with a cluster/outlier classification, the significance of a classification, an autocorrelation statistic (Local Moran's I), and the geometry id for each geometry in the original dataset.

-This function classifies your data as being part of a cluster, as an outlier, or not part of a pattern based the significance of a classification. The classification happens through an autocorrelation statistic called Local Moran's I.
+## Synopsis

-#### Arguments
+```sql
+table(numeric moran_val, text quadrant, numeric significance, int ids, numeric column_values) CDB_AreasOfInterest(text query, text column_name)

-| Name | Type | Description |
-|------|------|-------------|
-| subquery | TEXT | SQL query that exposes the data to be analyzed (e.g., `SELECT * FROM interesting_table`). This query must have the geometry column name `the_geom` and id column name `cartodb_id` unless otherwise specified in the input arguments |
-| column_name | TEXT | Name of column (e.g., should be `'interesting_value'` instead of `interesting_value` without single quotes) used for the analysis. |
-| weight type (optional) | TEXT | Type of weight to use when finding neighbors. Currently available options are 'knn' (default) and 'queen'. Read more about weight types in [PySAL's weights documentation](https://pysal.readthedocs.io/en/v1.11.0/users/tutorials/weights.html). |
-| num_ngbrs (optional) | INT | Number of neighbors if using k-nearest neighbors weight type. Defaults to 5. |
-| permutations (optional) | INT | Number of permutations to check against a random arrangement of the values in `column_name`. This influences the accuracy of the output field `significance`. Defaults to 99. |
-| geom_col (optional) | TEXT | The column name for the geometries. Defaults to `'the_geom'` |
-| id_col (optional) | TEXT | The column name for the unique ID of each geometry/value pair. Defaults to `'cartodb_id'`. |
+table(numeric moran_val, text quadrant, numeric significance, int ids, numeric column_values) CDB_AreasOfInterest(text query, text column_name, int permutations, text geom_column, text id_column, text weight_type, int num_ngbrs)
+```

-#### Returns
+## Description

-A table with the following columns.
+CDB_AreasOfInterest is a table-returning function that classifies the geometries in a table by an attribute and gives a significance for that classification. This information can be used to find "Areas of Interest" by using the correlation of a geometry's attribute with that of its neighbors. Areas can be clusters, outliers, or neither (depending on which significance value is used).

-| Column Name | Type | Description |
-|-------------|------|-------------|
-| moran | NUMERIC | Value of Moran's I (spatial autocorrelation measure) for the geometry with id of `rowid` |
-| quads | TEXT | Classification of geometry. Result is one of 'HH' (a high value with neighbors high on average), 'LL' (opposite of 'HH'), 'HL' (a high value surrounded by lows on average), and 'LH' (opposite of 'HL'). Null values are returned when nulls exist in the original data. |
-| significance | NUMERIC | The statistical significance (from 0 to 1) of a cluster or outlier classification. Lower numbers are more significant. |
-| rowid | INT | Row id of the values which correspond to the input rows. |
-| vals | NUMERIC | Values from `'column_name'`. |
+Inputs:

+* `query` (required): an arbitrary query against tables you have access to (e.g., in your account, shared in your organization, or through the Data Observatory). This string must contain the following columns: an id `INT` (e.g., `cartodb_id`), geometry (e.g., `the_geom`), and the numeric attribute which is specified in `column_name`
+* `column_name` (required): column to perform the area of interest analysis tool on. The data must be numeric (e.g., `float`, `int`, etc.)
+* `permutations` (optional): used to calculate the significance of a classification. Defaults to 99, which is sufficient in most situations.
+* `geom_column` (optional): the name of the geometry column. Data must be of type `geometry`.
+* `id_column` (optional): the name of the id column (e.g., `cartodb_id`). Data must be of type `int` or `bigint` and have a unique condition on the data.
+* `weight_type` (optional): the type of weight used for determining what defines a neighborhood. Options are `knn` or `queen`.
+* `num_ngbrs` (optional): the number of neighbors in a neighborhood around a geometry. Only used if `knn` is chosen above.

-#### Example Usage
+Outputs:
+
+* `moran_val`: underlying correlation statistic used in analysis
+* `quadrant`: human-readable interpretation of classification
+* `significance`: significance of classification (closer to 0 is more significant)
+* `ids`: id of original geometry (used for joining against original table if desired -- see examples)
+* `column_values`: original column values from `column_name`
+
+Availability: crankshaft v0.0.1 and above
+
+## Examples

 ```sql
 SELECT
-  c.the_geom,
-  aoi.quads,
+  t.the_geom_webmercator,
+  t.cartodb_id,
  aoi.significance,
-  c.num_cyclists_per_total_population
-FROM CDB_GetAreasOfInterestLocal('SELECT * FROM commute_data'
-                                 'num_cyclists_per_total_population') As aoi
-JOIN commute_data As c
-ON c.cartodb_id = aoi.rowid;
+  aoi.quadrant As aoi_quadrant
+FROM
+  observatory.acs2013 As t
+JOIN
+  crankshaft.CDB_AreasOfInterest('SELECT * FROM observatory.acs2013',
+                                 'gini_index')
 ```

-### CDB_AreasOfInterestGlobal(subquery text, column_name text)
+## API Usage

-This function identifies the extent to which geometries cluster (the groupings of geometries with similarly high or low values relative to the mean) or form outliers (areas where geometries have values opposite of their neighbors). The output of this function gives values between -1 and 1 as well as a significance of that classification. Values close to 0 mean that there is little to no distribution of values as compared to what one would see in a randomly distributed collection of geometries and values.
+Example

-#### Arguments
-
-| Name | Type | Description |
-|------|------|-------------|
-| subquery | TEXT | SQL query that exposes the data to be analyzed (e.g., `SELECT * FROM interesting_table`). This query must have the geometry column name `the_geom` and id column name `cartodb_id` unless otherwise specified in the input arguments |
-| column_name | TEXT | Name of column (e.g., should be `'interesting_value'` instead of `interesting_value` without single quotes) used for the analysis. |
-| weight type (optional) | TEXT | Type of weight to use when finding neighbors. Currently available options are 'knn' (default) and 'queen'. Read more about weight types in [PySAL's weights documentation](https://pysal.readthedocs.io/en/v1.11.0/users/tutorials/weights.html). |
-| num_ngbrs (optional) | INT | Number of neighbors if using k-nearest neighbors weight type. Defaults to 5. |
-| permutations (optional) | INT | Number of permutations to check against a random arrangement of the values in `column_name`. This influences the accuracy of the output field `significance`. Defaults to 99. |
-| geom_col (optional) | TEXT | The column name for the geometries. Defaults to `'the_geom'` |
-| id_col (optional) | TEXT | The column name for the unique ID of each geometry/value pair. Defaults to `'cartodb_id'`. |
-
-#### Returns
-
-A table with the following columns.
-
-| Column Name | Type | Description |
-|-------------|------|-------------|
-| moran | NUMERIC | Value of Moran's I (spatial autocorrelation measure) for the entire dataset. Values closer to one indicate cluster, closer to -1 mean more outliers, and near zero indicates a random distribution of data. |
-| significance | NUMERIC | The statistical significance of the `moran` measure. |
-
-#### Examples
-
-```sql
-SELECT *
-FROM CDB_AreasOfInterestGlobal('SELECT * FROM commute_data', 'num_cyclists_per_total_population')
+```text
+http://eschbacher.cartodb.com/api/v2/sql?q=SELECT * FROM crankshaft.CDB_AreasOfInterest('SELECT * FROM observatory.acs2013','gini_index')
 ```

-### CDB_AreasOfInterestLocalRate(subquery text, numerator_column text, denominator_column text)
-
-Just like `CDB_AreasOfInterestLocal`, this function classifies your data as being part of a cluster, as an outlier, or not part of a pattern based the significance of a classification. This function differs in that it calculates the classifications based on input `numerator` and `denominator` columns for finding the areas where there are clusters and outliers for the resulting rate of those two values.
-
-#### Arguments
-
-| Name | Type | Description |
-|------|------|-------------|
-| subquery | TEXT | SQL query that exposes the data to be analyzed (e.g., `SELECT * FROM interesting_table`). This query must have the geometry column name `the_geom` and id column name `cartodb_id` unless otherwise specified in the input arguments |
-| numerator | TEXT | Name of the numerator for forming a rate to be used in analysis. |
-| denominator | TEXT | Name of the denominator for forming a rate to be used in analysis. |
-| weight type (optional) | TEXT | Type of weight to use when finding neighbors. Currently available options are 'knn' (default) and 'queen'. Read more about weight types in [PySAL's weights documentation](https://pysal.readthedocs.io/en/v1.11.0/users/tutorials/weights.html). |
-| num_ngbrs (optional) | INT | Number of neighbors if using k-nearest neighbors weight type. Defaults to 5. |
-| permutations (optional) | INT | Number of permutations to check against a random arrangement of the values in `column_name`. This influences the accuracy of the output field `significance`. Defaults to 99. |
-| geom_col (optional) | TEXT | The column name for the geometries. Defaults to `'the_geom'` |
-| id_col (optional) | TEXT | The column name for the unique ID of each geometry/value pair. Defaults to `'cartodb_id'`. |
-
-#### Returns
-
-A table with the following columns.
-
-| Column Name | Type | Description |
-|-------------|------|-------------|
-| moran | NUMERIC | Value of Moran's I (spatial autocorrelation measure) for the geometry with id of `rowid` |
-| quads | TEXT | Classification of geometry. Result is one of 'HH' (a high value with neighbors high on average), 'LL' (opposite of 'HH'), 'HL' (a high value surrounded by lows on average), and 'LH' (opposite of 'HL'). Null values are returned when nulls exist in the original data. |
-| significance | NUMERIC | The statistical significance (from 0 to 1) of a cluster or outlier classification. Lower numbers are more significant. |
-| rowid | INT | Row id of the values which correspond to the input rows. |
-| vals | NUMERIC | Values from `'column_name'`. |
-
-
-#### Example Usage
-
-```sql
-SELECT
-  c.the_geom,
-  aoi.quads,
-  aoi.significance,
-  c.cyclists_per_total_population
-FROM CDB_GetAreasOfInterestLocalRate('SELECT * FROM commute_data'
-                                     'num_cyclists',
-                                     'total_population') As aoi
-JOIN commute_data As c
-ON c.cartodb_id = aoi.rowid;
+Result
+```json
+{
+  time: 0.120,
+  total_rows: 100,
+  rows: [{
+    moran_vals: 0.7213,
+    quadrant: 'High area',
+    significance: 0.03,
+    ids: 1,
+    column_value: 0.22
+  },
+  {
+    moran_vals: -0.7213,
+    quadrant: 'Low outlier',
+    significance: 0.13,
+    ids: 2,
+    column_value: 0.03
+  },
+  ...
+  ]
+}
 ```

-### CDB_AreasOfInterestGlobalRate(subquery text, column_name text)
+## See Also

-This function identifies the extent to which geometries cluster (the groupings of geometries with similarly high or low values relative to the mean) or form outliers (areas where geometries have values opposite of their neighbors). The output of this function gives values between -1 and 1 as well as a significance of that classification. Values close to 0 mean that there is little to no distribution of values as compared to what one would see in a randomly distributed collection of geometries and values.
+crankshaft's areas of interest functions:

-#### Arguments
+* [CDB_AreasOfInterest_Global]()
+* [CDB_AreasOfInterest_Rate_Local]()
+* [CDB_AreasOfInterest_Rate_Global]()

-| Name | Type | Description |
-|------|------|-------------|
-| subquery | TEXT | SQL query that exposes the data to be analyzed (e.g., `SELECT * FROM interesting_table`). This query must have the geometry column name `the_geom` and id column name `cartodb_id` unless otherwise specified in the input arguments |
-| numerator | TEXT | Name of the numerator for forming a rate to be used in analysis. |
-| denominator | TEXT | Name of the denominator for forming a rate to be used in analysis. |
-| weight type (optional) | TEXT | Type of weight to use when finding neighbors. Currently available options are 'knn' (default) and 'queen'. Read more about weight types in [PySAL's weights documentation](https://pysal.readthedocs.io/en/v1.11.0/users/tutorials/weights.html). |
-| num_ngbrs (optional) | INT | Number of neighbors if using k-nearest neighbors weight type. Defaults to 5. |
-| permutations (optional) | INT | Number of permutations to check against a random arrangement of the values in `column_name`. This influences the accuracy of the output field `significance`. Defaults to 99. |
-| geom_col (optional) | TEXT | The column name for the geometries. Defaults to `'the_geom'` |
-| id_col (optional) | TEXT | The column name for the unique ID of each geometry/value pair. Defaults to `'cartodb_id'`. |

-#### Returns
+PostGIS clustering functions:

-A table with the following columns.
+* [ST_ClusterIntersecting](http://postgis.net/docs/manual-2.2/ST_ClusterIntersecting.html)
+* [ST_ClusterWithin](http://postgis.net/docs/manual-2.2/ST_ClusterWithin.html)

-| Column Name | Type | Description |
-|-------------|------|-------------|
-| moran | NUMERIC | Value of Moran's I (spatial autocorrelation measure) for the entire dataset. Values closer to one indicate cluster, closer to -1 mean more outliers, and near zero indicates a random distribution of data. |
-| significance | NUMERIC | The statistical significance of the `moran` measure. |

-#### Examples
+-- removing below, working into above

-```sql
-SELECT *
-FROM CDB_AreasOfInterestGlobalRate('SELECT * FROM commute_data',          
-                                   'num_cyclists',
-                                   'total_population')
-```
+#### What is Moran's I and why is it significant for CartoDB?

-## Hotspot, Coldspot, and Outlier Functions
+Moran's I is a geostatistical calculation which gives a measure of the global
+clustering and presence of outliers within the geographies in a map. Here global
+means over all of the geographies in a dataset. Imagine mapping the incidence
+rates of cancer in neighborhoods of a city. If there were areas covering several
+neighborhoods with abnormally low rates of cancer, those areas are positively
+spatially correlated with one another and would be considered a cluster. If
+there was a single neighborhood with a high rate but with all neighbors on
+average having a low rate, it would be considered a spatial outlier.

-These functions are convenience functions for extracting only information that you are interested in exposing based on the outputs of the `CDB_AreasOfInterest` functions. For instance, you can use `CDB_GetSpatialHotspots` to output only the classifications of `HH` and `HL`.
+While Moran's I gives a global snapshot, there are local indicators for
+clustering called Local Indicators of Spatial Autocorrelation. Clustering is a
+process related to autocorrelation -- i.e., a process that compares a
+geography's attribute to the attribute in neighbor geographies.

-### Non-rate functions
+For the example of cancer rates in neighborhoods, since these neighborhoods have
+a high value for rate of cancer, and all of their neighbors do as well, they are
+designated as "High High" or simply **HH**. For areas with multiple neighborhoods
+with low rates of cancer, they are designated as "Low Low" or **LL**. HH and LL
+naturally fit into the concept of clustering and are in the correlated
+variables.

-#### CDB_GetSpatialHotspots
-This function's inputs and outputs exactly mirror `CDB_AreasOfInterestLocal` except that the outputs are filtered to be only 'HH' and 'HL' (areas of high values). For more information about this function's use, see `CDB_AreasOfInterestLocal`.
+"Anticorrelated" geogs are in **LH** and **HL** regions -- that is, regions
+where a geog has a high value and it's neighbors, on average, have a low value
+(or vice versa). An example of this is a "gated community" or placement of a
+city housing project in a rich region. These deliberate developments have
+opposite median income as compared to the neighbors around them. They have a
+high (or low) value while their neighbors have a low (or high) value. They exist
+typically as islands, and in rare circumstances can extend as chains dividing
+**LL** or **HH**.

-#### CDB_GetSpatialColdspots
-This function's inputs and outputs exactly mirror `CDB_AreasOfInterestLocal` except that the outputs are filtered to be only 'LL' and 'LH' (areas of low values). For more information about this function's use, see `CDB_AreasOfInterestLocal`.
+Strong policies such as rent stabilization (probably) tend to prevent the
+clustering of high rent areas as they integrate middle class incomes. Luxury
+apartment buildings, which are a kind of gated community, probably tend to skew
+an area's median income upwards while housing projects have the opposite effect.
+What are the nuggets in the analysis?

-#### CDB_GetSpatialOutliers
-This function's inputs and outputs exactly mirror `CDB_AreasOfInterestLocal` except that the outputs are filtered to be only 'HL' and 'LH' (areas where highs or lows are surrounded by opposite values on average). For more information about this function's use, see `CDB_AreasOfInterestLocal`.
+Two functions are available to compute Moran I statistics:

-### Rate functions
+* `cdb_moran_local` computes Moran I measures, quad classification and
+  significance values from numerial values associated to geometry entities
+  in an input table. The geometries should be contiguous polygons When
+  then `queen` `w_type` is used.
+* `cdb_moran_local_rate` computes the same statistics using a ratio between
+  numerator and denominator columns of a table.

-#### CDB_GetSpatialHotspotsRate
+The parameters for `cdb_moran_local` are:

-This function's inputs and outputs exactly mirror `CDB_AreasOfInterestLocalRate` except that the outputs are filtered to be only 'HH' and 'HL' (areas of high values). For more information about this function's use, see `CDB_AreasOfInterestLocalRate`.
+* `table` name of the table that contains the data values
+* `attr` name of the column
+* `signficance` significance threshold for the quads values
+* `num_ngbrs` number of neighbors to consider (default: 5)
+* `permutations` number of random permutations for calculation of
+  pseudo-p values (default: 99)
+* `geom_column` number of the geometry column (default: "the_geom")
+* `id_col` PK column of the table (default: "cartodb_id")
+* `w_type` Weight types: can be "knn" for k-nearest neighbor weights
+  or "queen" for contiguity based weights.

-#### CDB_GetSpatialColdspotsRate
+The function returns a table with the following columns:

-This function's inputs and outputs exactly mirror `CDB_AreasOfInterestLocalRate` except that the outputs are filtered to be only 'LL' and 'LH' (areas of low values). For more information about this function's use, see `CDB_AreasOfInterestLocalRate`.
+* `moran` Moran's value
+* `quads` quad classification ('HH', 'LL', 'HL', 'LH' or 'Not significant')
+* `significance` significance value
+* `ids` id of the corresponding record in the input table

-#### CDB_GetSpatialOutliersRate
-
-This function's inputs and outputs exactly mirror `CDB_AreasOfInterestLocalRate` except that the outputs are filtered to be only 'HL' and 'LH' (areas where highs or lows are surrounded by opposite values on average). For more information about this function's use, see `CDB_AreasOfInterestLocalRate`.
+Function `cdb_moran_local_rate` only differs in that the `attr` input
+parameter is substituted by `numerator` and `denominator`.
--- a/doc/07_gravity.md
+++ b/doc/07_gravity.md
@@ -1,78 +0,0 @@
-## Gravity Model
-
-Gravity Models are derived from Newton's Law of Gravity and are used to predict the interaction between a group of populated areas (sources) and a specific target among a group of potential targets, in terms of an attraction factor (weight)
-
-**CDB_Gravity** is based on the model defined in *Huff's Law of Shopper attraction (1963)*
-
-### CDB_Gravity(t_id bigint[], t_geom geometry[], t_weight numeric[], s_id bigint[], s_geom geometry[], s_pop numeric[], target bigint, radius integer, minval numeric DEFAULT -10e307)
-
-#### Arguments
-
-| Name | Type | Description |
-|------|------|-------------|
-| t_id     | bigint[]    | Array of targets ID |
-| t_geom   | geometry[]  | Array of targets' geometries |
-| t_weight | numeric[]   | Array of targets's weights |
-| s_id     | bigint[]    | Array of sources ID |
-| s_geom   | geometry[]  | Array of sources' geometries |
-| s_pop    | numeric[]   | Array of sources's population |
-| target   | bigint      | ID of the target under study |
-| radius   | integer     | Radius in meters around the target under study that will be taken into account|
-| minval (optional)   | numeric     | Lowest accepted value of weight, defaults to numeric min_value |
-
-### CDB_Gravity( target_query text, weight_column text, source_query text, pop_column text, target bigint, radius integer, minval numeric DEFAULT -10e307)
-
-#### Arguments
-
-| Name | Type | Description |
-|------|------|-------------|
-| target_query     | text    | Query that defines targets |
-| weight_column   | text  | Column name of weights |
-| source_query     | text    | Query that defines sources |
-| pop_column   | text  | Column name of population |
-| target   | bigint      | cartodb_id of the target under study |
-| radius   | integer     | Radius in meters around the target under study that will be taken into account|
-| minval (optional)   | numeric     | Lowest accepted value of weight, defaults to numeric min_value |
-
-
-### Returns
-
-| Column Name | Type | Description |
-|-------------|------|-------------|
-| the_geom  | geometry | Geometries of the sources within the radius |
-| source_id | bigint  | ID of the source |
-| target_id | bigint  | Target ID from input |
-| dist      | numeric | Distance in meters source to target (if not points, distance between centroids) |
-| h         | numeric | Probability of patronage |
-| hpop      | numeric | Patronaging population |
-
-
-#### Example Usage
-
-```sql
-with t as (
-SELECT
-    array_agg(cartodb_id::bigint) as id,
-    array_agg(the_geom) as g,
-    array_agg(coalesce(gla,0)::numeric) as w
-FROM
-    abel.centros_comerciales_de_madrid
-WHERE not no_cc
-),
-s as (
-SELECT
-    array_agg(cartodb_id::bigint) as id,
-    array_agg(center) as g,
-    array_agg(coalesce(t1_1, 0)::numeric) as p
-FROM
-    sscc_madrid
-)
-select
-    g.the_geom,
-    trunc(g.h,2) as h,
-    round(g.hpop) as hpop,
-    trunc(g.dist/1000,2) as dist_km
-FROM t, s, CDB_Gravity1(t.id, t.g, t.w, s.id, s.g, s.p, newmall_ID, 100000, 5000) g
-```
-
-
--- a/src/pg/sql/07_gravity.sql
+++ b/src/pg/sql/07_gravity.sql
@@ -1,115 +0,0 @@
-CREATE OR REPLACE FUNCTION CDB_Gravity(
-    IN target_query text,
-    IN weight_column text,
-    IN source_query text,
-    IN pop_column text,
-    IN target bigint,
-    IN radius integer,
-    IN minval numeric DEFAULT -10e307
-    )
-RETURNS TABLE(
-    the_geom geometry,
-    source_id bigint,
-    target_id bigint,
-    dist numeric,
-    h numeric,
-    hpop numeric)  AS $$
-DECLARE
-    t_id bigint[];
-    t_geom geometry[];
-    t_weight numeric[];
-    s_id bigint[];
-    s_geom geometry[];
-    s_pop numeric[];
-BEGIN
-    EXECUTE 'WITH foo as('+target_query+') SELECT array_agg(cartodb_id), array_agg(the_geom), array_agg(' || weight_column || ') FROM foo' INTO t_id, t_geom, t_weight;
-    EXECUTE 'WITH foo as('+source_query+') SELECT array_agg(cartodb_id), array_agg(the_geom), array_agg(' || pop_column || ') FROM foo' INTO s_id, s_geom, s_pop;
-    RETURN QUERY
-    SELECT g.* FROM t, s, CDB_Gravity(t_id, t_geom, t_weight, s_id, s_geom, s_pop, target, radius, minval) g;
-END;
-$$ language plpgsql;
-
-CREATE OR REPLACE FUNCTION CDB_Gravity(
-    IN t_id bigint[],
-    IN t_geom geometry[],
-    IN t_weight numeric[],
-    IN s_id bigint[],
-    IN s_geom geometry[],
-    IN s_pop numeric[],
-    IN target bigint,
-    IN radius integer,
-    IN minval numeric DEFAULT -10e307
-    )
-RETURNS TABLE(
-    the_geom geometry,
-    source_id bigint,
-    target_id bigint,
-    dist numeric,
-    h numeric,
-    hpop numeric)  AS $$
-DECLARE
-    t_type text;
-    s_type text;
-    t_center geometry[];
-    s_center geometry[];
-BEGIN
-    t_type := GeometryType(t_geom[1]);
-    s_type := GeometryType(s_geom[1]);
-    IF t_type = 'POINT' THEN
-        t_center := t_geom;
-    ELSE
-        WITH tmp as (SELECT unnest(t_geom) as g) SELECT array_agg(ST_Centroid(g)) INTO t_center FROM tmp;
-    END IF;
-    IF s_type = 'POINT' THEN
-        s_center := s_geom;
-    ELSE
-        WITH tmp as (SELECT unnest(s_geom) as g) SELECT array_agg(ST_Centroid(g)) INTO s_center FROM tmp;
-    END IF;
-    RETURN QUERY
-        with target0 as(
-            SELECT unnest(t_center) as tc, unnest(t_weight) as tw, unnest(t_id) as td
-        ),
-        source0 as(
-            SELECT unnest(s_center) as sc, unnest(s_id) as sd, unnest (s_geom) as sg, unnest(s_pop) as sp
-        ),
-        prev0 as(
-            SELECT
-                source0.sg,
-                source0.sd as sourc_id,
-                coalesce(source0.sp,0) as sp,
-                target.td as targ_id,
-                coalesce(target.tw,0) as tw,
-                GREATEST(1.0,ST_Distance(geography(target.tc), geography(source0.sc)))::numeric as distance
-            FROM source0
-            CROSS JOIN LATERAL
-                (
-                SELECT
-                    *
-                FROM target0
-                    WHERE tw > minval
-                    AND ST_DWithin(geography(source0.sc), geography(tc), radius)
-                ) AS target
-        ),
-        deno as(
-            SELECT
-                sourc_id,
-                sum(tw/distance) as h_deno
-            FROM
-                prev0
-            GROUP BY sourc_id
-        )
-        SELECT
-            p.sg as the_geom,
-            p.sourc_id as source_id,
-            p.targ_id as target_id,
-            case when p.distance > 1 then p.distance else 0.0 end as dist,
-            100*(p.tw/p.distance)/d.h_deno as h,
-            p.sp*(p.tw/p.distance)/d.h_deno as hpop
-        FROM
-            prev0 p,
-            deno d
-        WHERE
-            p.targ_id = target AND
-            p.sourc_id = d.sourc_id;
-END;
-$$ language plpgsql;
--- a/src/pg/sql/10_moran.sql
+++ b/src/pg/sql/10_moran.sql
@@ -1,233 +1,89 @@
-- Moran's I Global Measure (public-facing)
+-- Moran's I (global)
 CREATE OR REPLACE FUNCTION
-  CDB_AreasOfInterestGlobal(
+  CDB_AreasOfInterest_Global (
      subquery TEXT,
-      column_name TEXT,
-      w_type TEXT DEFAULT 'knn',
-      num_ngbrs INT DEFAULT 5,
+      attr_name TEXT,
      permutations INT DEFAULT 99,
      geom_col TEXT DEFAULT 'the_geom',
-      id_col TEXT DEFAULT 'cartodb_id')
+      id_col TEXT DEFAULT 'cartodb_id',
+      w_type TEXT DEFAULT 'knn',
+      num_ngbrs INT DEFAULT 5)
 RETURNS TABLE (moran NUMERIC, significance NUMERIC)
 AS $$
-  plpy.execute('SELECT cdb_crankshaft._cdb_crankshaft_activate_py()')
  from crankshaft.clustering import moran_local
  # TODO: use named parameters or a dictionary
-  return moran(subquery, column_name, w_type, num_ngbrs, permutations, geom_col, id_col)
+  return moran(subquery, attr, num_ngbrs, permutations, geom_col, id_col, w_type)
 $$ LANGUAGE plpythonu;

-- Moran's I Local (internal function)
+-- Moran's I Local
 CREATE OR REPLACE FUNCTION
-  _CDB_AreasOfInterestLocal(
+  CDB_AreasOfInterest_Local(
      subquery TEXT,
-      column_name TEXT,
-      w_type TEXT,
-      num_ngbrs INT,
-      permutations INT,
-      geom_col TEXT,
-      id_col TEXT)
-RETURNS TABLE (moran NUMERIC, quads TEXT, significance NUMERIC, rowid INT, vals NUMERIC)
+      attr TEXT,
+      permutations INT DEFAULT 99,
+      geom_col TEXT DEFAULT 'the_geom',
+      id_col TEXT DEFAULT 'cartodb_id',
+      w_type TEXT DEFAULT 'knn',
+      num_ngbrs INT DEFAULT 5)
+RETURNS TABLE (moran NUMERIC, quads TEXT, significance NUMERIC, ids INT, y NUMERIC)
 AS $$
-  plpy.execute('SELECT cdb_crankshaft._cdb_crankshaft_activate_py()')
  from crankshaft.clustering import moran_local
  # TODO: use named parameters or a dictionary
-  return moran_local(subquery, column_name, w_type, num_ngbrs, permutations, geom_col, id_col)
+  return moran_local(subquery, attr, permutations, geom_col, id_col, w_type, num_ngbrs)
 $$ LANGUAGE plpythonu;

-- Moran's I Local (public-facing function)
+-- Moran's I Rate (global)
 CREATE OR REPLACE FUNCTION
-  CDB_AreasOfInterestLocal(
-    subquery TEXT,
-    column_name TEXT,
-    w_type TEXT DEFAULT 'knn',
-    num_ngbrs INT DEFAULT 5,
-    permutations INT DEFAULT 99,
-    geom_col TEXT DEFAULT 'the_geom',
-    id_col TEXT DEFAULT 'cartodb_id')
-RETURNS TABLE (moran NUMERIC, quads TEXT, significance NUMERIC, rowid INT, vals NUMERIC)
-AS $$
-
-  SELECT moran, quads, significance, rowid, vals
-  FROM cdb_crankshaft._CDB_AreasOfInterestLocal(subquery, column_name, w_type, num_ngbrs, permutations, geom_col, id_col);
-
-$$ LANGUAGE SQL;
-
-- Moran's I only for HH and HL (public-facing function)
-CREATE OR REPLACE FUNCTION
-  CDB_GetSpatialHotspots(
-    subquery TEXT,
-    column_name TEXT,
-    w_type TEXT DEFAULT 'knn',
-    num_ngbrs INT DEFAULT 5,
-    permutations INT DEFAULT 99,
-    geom_col TEXT DEFAULT 'the_geom',
-    id_col TEXT DEFAULT 'cartodb_id')
-    RETURNS TABLE (moran NUMERIC, quads TEXT, significance NUMERIC, rowid INT, vals NUMERIC)
-AS $$
-
-  SELECT moran, quads, significance, rowid, vals
-  FROM cdb_crankshaft._CDB_AreasOfInterestLocal(subquery, column_name, w_type, num_ngbrs, permutations, geom_col, id_col)
-  WHERE quads IN ('HH', 'HL');
-
-$$ LANGUAGE SQL;
-
-- Moran's I only for LL and LH (public-facing function)
-CREATE OR REPLACE FUNCTION
-  CDB_GetSpatialColdspots(
-    subquery TEXT,
-    attr TEXT,
-    w_type TEXT DEFAULT 'knn',
-    num_ngbrs INT DEFAULT 5,
-    permutations INT DEFAULT 99,
-    geom_col TEXT DEFAULT 'the_geom',
-    id_col TEXT DEFAULT 'cartodb_id')
-    RETURNS TABLE (moran NUMERIC, quads TEXT, significance NUMERIC, rowid INT, vals NUMERIC)
-AS $$
-
-  SELECT moran, quads, significance, rowid, vals
-  FROM cdb_crankshaft._CDB_AreasOfInterestLocal(subquery, attr, w_type, num_ngbrs, permutations, geom_col, id_col)
-  WHERE quads IN ('LL', 'LH');
-
-$$ LANGUAGE SQL;
-
-- Moran's I only for LH and HL (public-facing function)
-CREATE OR REPLACE FUNCTION
-  CDB_GetSpatialOutliers(
-    subquery TEXT,
-    attr TEXT,
-    w_type TEXT DEFAULT 'knn',
-    num_ngbrs INT DEFAULT 5,
-    permutations INT DEFAULT 99,
-    geom_col TEXT DEFAULT 'the_geom',
-    id_col TEXT DEFAULT 'cartodb_id')
-    RETURNS TABLE (moran NUMERIC, quads TEXT, significance NUMERIC, rowid INT, vals NUMERIC)
-AS $$
-
-  SELECT moran, quads, significance, rowid, vals
-  FROM cdb_crankshaft._CDB_AreasOfInterestLocal(subquery, attr, w_type, num_ngbrs, permutations, geom_col, id_col)
-  WHERE quads IN ('HL', 'LH');
-
-$$ LANGUAGE SQL;
-
-- Moran's I Global Rate (public-facing function)
-CREATE OR REPLACE FUNCTION
-  CDB_AreasOfInterestGlobalRate(
+  CDB_AreasOfInterest_Global_Rate(
      subquery TEXT,
      numerator TEXT,
      denominator TEXT,
-      w_type TEXT DEFAULT 'knn',
-      num_ngbrs INT DEFAULT 5,
      permutations INT DEFAULT 99,
      geom_col TEXT DEFAULT 'the_geom',
-      id_col TEXT DEFAULT 'cartodb_id')
+      id_col TEXT DEFAULT 'cartodb_id',
+      w_type TEXT DEFAULT 'knn',
+      num_ngbrs INT DEFAULT 5)
 RETURNS TABLE (moran FLOAT, significance FLOAT)
 AS $$
-  plpy.execute('SELECT cdb_crankshaft._cdb_crankshaft_activate_py()')
  from crankshaft.clustering import moran_local
  # TODO: use named parameters or a dictionary
-  return moran_rate(subquery, numerator, denominator, w_type, num_ngbrs, permutations, geom_col, id_col)
+  return moran_rate(subquery, numerator, denominator, permutations, geom_col, id_col, w_type, num_ngbrs)
 $$ LANGUAGE plpythonu;


-- Moran's I Local Rate (internal function)
+-- Moran's I Local Rate
 CREATE OR REPLACE FUNCTION
-  _CDB_AreasOfInterestLocalRate(
+  CDB_AreasOfInterest_Local_Rate(
      subquery TEXT,
      numerator TEXT,
      denominator TEXT,
-      w_type TEXT,
-      num_ngbrs INT,
-      permutations INT,
-      geom_col TEXT,
-      id_col TEXT)
+      permutations INT DEFAULT 99,
+      geom_col TEXT DEFAULT 'the_geom',
+      id_col TEXT DEFAULT 'cartodb_id',
+      w_type TEXT DEFAULT 'knn',
+      num_ngbrs INT DEFAULT 5)
 RETURNS
-TABLE(moran NUMERIC, quads TEXT, significance NUMERIC, rowid INT, vals NUMERIC)
+TABLE(moran NUMERIC, quads TEXT, significance NUMERIC, ids INT, y NUMERIC)
 AS $$
-  plpy.execute('SELECT cdb_crankshaft._cdb_crankshaft_activate_py()')
  from crankshaft.clustering import moran_local_rate
  # TODO: use named parameters or a dictionary
-  return moran_local_rate(subquery, numerator, denominator, w_type, num_ngbrs, permutations, geom_col, id_col)
+  return moran_local_rate(subquery, numerator, denominator, permutations, geom_col, id_col, w_type, num_ngbrs)
 $$ LANGUAGE plpythonu;

-- Moran's I Local Rate (public-facing function)
-CREATE OR REPLACE FUNCTION
-  CDB_AreasOfInterestLocalRate(
-      subquery TEXT,
-      numerator TEXT,
-      denominator TEXT,
-      w_type TEXT DEFAULT 'knn',
-      num_ngbrs INT DEFAULT 5,
-      permutations INT DEFAULT 99,
-      geom_col TEXT DEFAULT 'the_geom',
-      id_col TEXT DEFAULT 'cartodb_id')
-RETURNS
-TABLE(moran NUMERIC, quads TEXT, significance NUMERIC, rowid INT, vals NUMERIC)
-AS $$
-
-  SELECT moran, quads, significance, rowid, vals
-  FROM cdb_crankshaft._CDB_AreasOfInterestLocalRate(subquery, numerator, denominator, w_type, num_ngbrs, permutations, geom_col, id_col);
-
-$$ LANGUAGE SQL;
-
-- Moran's I Local Rate only for HH and HL (public-facing function)
-CREATE OR REPLACE FUNCTION
-  CDB_GetSpatialHotspotsRate(
-      subquery TEXT,
-      numerator TEXT,
-      denominator TEXT,
-      w_type TEXT DEFAULT 'knn',
-      num_ngbrs INT DEFAULT 5,
-      permutations INT DEFAULT 99,
-      geom_col TEXT DEFAULT 'the_geom',
-      id_col TEXT DEFAULT 'cartodb_id')
-RETURNS
-TABLE(moran NUMERIC, quads TEXT, significance NUMERIC, rowid INT, vals NUMERIC)
-AS $$
-
-  SELECT moran, quads, significance, rowid, vals
-  FROM cdb_crankshaft._CDB_AreasOfInterestLocalRate(subquery, numerator, denominator, w_type, num_ngbrs, permutations, geom_col, id_col)
-  WHERE quads IN ('HH', 'HL');
-
-$$ LANGUAGE SQL;
-
-- Moran's I Local Rate only for LL and LH (public-facing function)
-CREATE OR REPLACE FUNCTION
-  CDB_GetSpatialColdspotsRate(
-      subquery TEXT,
-      numerator TEXT,
-      denominator TEXT,
-      w_type TEXT DEFAULT 'knn',
-      num_ngbrs INT DEFAULT 5,
-      permutations INT DEFAULT 99,
-      geom_col TEXT DEFAULT 'the_geom',
-      id_col TEXT DEFAULT 'cartodb_id')
-RETURNS
-TABLE(moran NUMERIC, quads TEXT, significance NUMERIC, rowid INT, vals NUMERIC)
-AS $$
-
-  SELECT moran, quads, significance, rowid, vals
-  FROM cdb_crankshaft._CDB_AreasOfInterestLocalRate(subquery, numerator, denominator, w_type, num_ngbrs, permutations, geom_col, id_col)
-  WHERE quads IN ('LL', 'LH');
-
-$$ LANGUAGE SQL;
-
-- Moran's I Local Rate only for LH and HL (public-facing function)
-CREATE OR REPLACE FUNCTION
-  CDB_GetSpatialOutliersRate(
-      subquery TEXT,
-      numerator TEXT,
-      denominator TEXT,
-      w_type TEXT DEFAULT 'knn',
-      num_ngbrs INT DEFAULT 5,
-      permutations INT DEFAULT 99,
-      geom_col TEXT DEFAULT 'the_geom',
-      id_col TEXT DEFAULT 'cartodb_id')
-RETURNS
-TABLE(moran NUMERIC, quads TEXT, significance NUMERIC, rowid INT, vals NUMERIC)
-AS $$
-
-  SELECT moran, quads, significance, rowid, vals
-  FROM cdb_crankshaft._CDB_AreasOfInterestLocalRate(subquery, numerator, denominator, w_type, num_ngbrs, permutations, geom_col, id_col)
-  WHERE quads IN ('HL', 'LH');
-
-$$ LANGUAGE SQL;
+-- -- Moran's I Local Bivariate
+-- CREATE OR REPLACE FUNCTION
+--   cdb_moran_local_bv(
+--       subquery TEXT,
+--       attr1 TEXT,
+--       attr2 TEXT,
+--       permutations INT DEFAULT 99,
+--       geom_col TEXT DEFAULT 'the_geom',
+--       id_col TEXT DEFAULT 'cartodb_id',
+--       w_type TEXT DEFAULT 'knn',
+--       num_ngbrs INT DEFAULT 5)
+-- RETURNS TABLE(moran FLOAT, quads TEXT, significance FLOAT, ids INT, y numeric)
+-- AS $$
+--   from crankshaft.clustering import moran_local_bv
+--   # TODO: use named parameters or a dictionary
+--   return moran_local_bv(t, attr1, attr2, permutations, geom_col, id_col, w_type, num_ngbrs)
+-- $$ LANGUAGE plpythonu;
--- a/src/pg/sql/80_similarity_rank.sql
+++ b/src/pg/sql/80_similarity_rank.sql
@@ -0,0 +1,15 @@
+CREATE OR REPLACE FUNCTION cdb_SimilarityRank(cartodb_id numeric, query text)
+returns TABLE (cartodb_id NUMERIC, similarity NUMERIC)
+as $$
+  plpy.execute('SELECT cdb_crankshaft._cdb_crankshaft_activate_py()')
+  from crankshaft.similarity import similarity_rank
+  return similarity_rank(cartodb_id, query)
+$$ LANGUAGE plpythonu;
+
+CREATE OR REPLACE FUNCTION cdb_MostSimilar(cartodb_id numeric, query text ,matches numeric)
+returns TABLE (cartodb_id NUMERIC, similarity NUMERIC)
+as $$
+  plpy.execute('SELECT cdb_crankshaft._cdb_crankshaft_activate_py()')
+  from crankshaft.similarity import most_similar
+  return most_similar(matches, query)
+$$ LANGUAGE plpythonu;
--- a/src/pg/test/expected/02_moran_test.out
+++ b/src/pg/test/expected/02_moran_test.out
@@ -1,277 +1,248 @@
-\pset format unaligned
-\set ECHO all
 \i test/fixtures/ppoints.sql
-SET client_min_messages TO WARNING;
-\set ECHO none
-_cdb_random_seeds
-
+-- test table (spanish province centroids with some invented values)
+CREATE TABLE ppoints (cartodb_id integer, the_geom geometry, the_geom_webmercator geometry, code text, region_code text, value float);
+INSERT INTO ppoints VALUES
+( 1,'0101000020E6100000A8306DC0CBC305C051D14B6CE56A4540'::geometry,ST_Transform('0101000020E6100000A8306DC0CBC305C051D14B6CE56A4540'::geometry, 3857),'01','16',0.5),
+( 4,'0101000020E6100000E220A4362DC202C0FD8AFA5119994240'::geometry,ST_Transform('0101000020E6100000E220A4362DC202C0FD8AFA5119994240'::geometry, 3857),'04','01',0.1),
+( 5,'0101000020E610000004377E573AC813C0CB5871BB17494440'::geometry,ST_Transform('0101000020E610000004377E573AC813C0CB5871BB17494440'::geometry, 3857),'05','07',0.3),
+( 2,'0101000020E610000000F49BE19BAFFFBF639958FDA6694340'::geometry,ST_Transform('0101000020E610000000F49BE19BAFFFBF639958FDA6694340'::geometry, 3857),'02','08',0.7),
+( 3,'0101000020E61000005D0B7E63C832E2BFDB63EB00443D4340'::geometry,ST_Transform('0101000020E61000005D0B7E63C832E2BFDB63EB00443D4340'::geometry, 3857),'03','10',0.2),
+( 6,'0101000020E61000006F3742B7FB9018C0DD967DC4D95A4340'::geometry,ST_Transform('0101000020E61000006F3742B7FB9018C0DD967DC4D95A4340'::geometry, 3857),'06','11',0.05),
+( 7,'0101000020E6100000E4BB36995F4C0740EAC0E5CA9FC94340'::geometry,ST_Transform('0101000020E6100000E4BB36995F4C0740EAC0E5CA9FC94340'::geometry, 3857),'07','04',0.4),
+( 8,'0101000020E61000003D43CC6CAFBEFF3F6B52E66F91DD4440'::geometry,ST_Transform('0101000020E61000003D43CC6CAFBEFF3F6B52E66F91DD4440'::geometry, 3857),'08','09',0.7),
+( 9,'0101000020E61000003CC797BD99AF0CC0495A87FA312F4540'::geometry,ST_Transform('0101000020E61000003CC797BD99AF0CC0495A87FA312F4540'::geometry, 3857),'09','07',0.5),
+(13,'0101000020E61000001CAA00A9F19F0EC05DF9267B7A764340'::geometry,ST_Transform('0101000020E61000001CAA00A9F19F0EC05DF9267B7A764340'::geometry, 3857),'13','08',0.4),
+(16,'0101000020E6100000D8208F3CBC9001C065638DC1B1F24340'::geometry,ST_Transform('0101000020E6100000D8208F3CBC9001C065638DC1B1F24340'::geometry, 3857),'16','08',0.4),
+(17,'0101000020E6100000E9E6A94A71630540AD7A0CB062104540'::geometry,ST_Transform('0101000020E6100000E9E6A94A71630540AD7A0CB062104540'::geometry, 3857),'17','09',0.6),
+(18,'0101000020E6100000719792D59E240AC098AC548E00A84240'::geometry,ST_Transform('0101000020E6100000719792D59E240AC098AC548E00A84240'::geometry, 3857),'18','01',0.3),
+(19,'0101000020E6100000972C878B50FD04C0123C881D1F684440'::geometry,ST_Transform('0101000020E6100000972C878B50FD04C0123C881D1F684440'::geometry, 3857),'19','08',0.7),
+(21,'0101000020E6100000F7893E9934511BC0EAA4BF03E1C94240'::geometry,ST_Transform('0101000020E6100000F7893E9934511BC0EAA4BF03E1C94240'::geometry, 3857),'21','01',0.1),
+(22,'0101000020E6100000572C2123B2A8B2BF7ED7FABAFD194540'::geometry,ST_Transform('0101000020E6100000572C2123B2A8B2BF7ED7FABAFD194540'::geometry, 3857),'22','02',0.4),
+(25,'0101000020E6100000461B67D688C4F03FD990EEC3A0054540'::geometry,ST_Transform('0101000020E6100000461B67D688C4F03FD990EEC3A0054540'::geometry, 3857),'25','09',0.4),
+(26,'0101000020E6100000A139FB06E82204C0539D84F62E234540'::geometry,ST_Transform('0101000020E6100000A139FB06E82204C0539D84F62E234540'::geometry, 3857),'26','17',0.6),
+(27,'0101000020E6100000A92E54E618C91DC00D3A947B81814540'::geometry,ST_Transform('0101000020E6100000A92E54E618C91DC00D3A947B81814540'::geometry, 3857),'27','12',0.3),
+(28,'0101000020E6100000971DC8B682BC0DC016D0E8055F3F4440'::geometry,ST_Transform('0101000020E6100000971DC8B682BC0DC016D0E8055F3F4440'::geometry, 3857),'28','13',0.8),
+(30,'0101000020E6100000A2DC1964A8C5F7BF19299C994D004340'::geometry,ST_Transform('0101000020E6100000A2DC1964A8C5F7BF19299C994D004340'::geometry, 3857),'30','14',0.1),
+(31,'0101000020E6100000DCA1FCC87B56FABF9B88E9D866554540'::geometry,ST_Transform('0101000020E6100000DCA1FCC87B56FABF9B88E9D866554540'::geometry, 3857),'31','15',0.9),
+(32,'0101000020E6100000E1517AFCD15E1EC0A18D8D4825194540'::geometry,ST_Transform('0101000020E6100000E1517AFCD15E1EC0A18D8D4825194540'::geometry, 3857),'32','12',0.3),
+(33,'0101000020E6100000A7FF33825AF917C0FABE7DFB6BA54540'::geometry,ST_Transform('0101000020E6100000A7FF33825AF917C0FABE7DFB6BA54540'::geometry, 3857),'33','03',0.4),
+(34,'0101000020E6100000FB4E4EBEB72412C0898E7240982F4540'::geometry,ST_Transform('0101000020E6100000FB4E4EBEB72412C0898E7240982F4540'::geometry, 3857),'34','07',0.3),
+(35,'0101000020E6100000224682B01B1A2DC011091656CC5C3C40'::geometry,ST_Transform('0101000020E6100000224682B01B1A2DC011091656CC5C3C40'::geometry, 3857),'35','05',0.3),
+(36,'0101000020E6100000F7C9447110EC20C04C5D4823C7374540'::geometry,ST_Transform('0101000020E6100000F7C9447110EC20C04C5D4823C7374540'::geometry, 3857),'36','12',0.2),
+(37,'0101000020E610000053D6A26DFB4218C09D58FAE209674440'::geometry,ST_Transform('0101000020E610000053D6A26DFB4218C09D58FAE209674440'::geometry, 3857),'37','07',0.5),
+(38,'0101000020E6100000B1D1B5FC910431C03C0C89BA03503C40'::geometry,ST_Transform('0101000020E6100000B1D1B5FC910431C03C0C89BA03503C40'::geometry, 3857),'38','05',0.4),
+(39,'0101000020E610000086E6FEE1BD1E10C00417096748994540'::geometry,ST_Transform('0101000020E610000086E6FEE1BD1E10C00417096748994540'::geometry, 3857),'39','06',0.6),
+(40,'0101000020E6100000FB51C33F733710C038D01729E4954440'::geometry,ST_Transform('0101000020E6100000FB51C33F733710C038D01729E4954440'::geometry, 3857),'40','07',0.5),
+(41,'0101000020E6100000912D6FDA28BB16C031321F08C4B74240'::geometry,ST_Transform('0101000020E6100000912D6FDA28BB16C031321F08C4B74240'::geometry, 3857),'41','01',0.4),
+(42,'0101000020E6100000554432EABEB504C069ECD78775CF4440'::geometry,ST_Transform('0101000020E6100000554432EABEB504C069ECD78775CF4440'::geometry, 3857),'42','07',0.2),
+(43,'0101000020E6100000157F117C1A2EEA3F027CD1F2368B4440'::geometry,ST_Transform('0101000020E6100000157F117C1A2EEA3F027CD1F2368B4440'::geometry, 3857),'43','09',0.3),
+(44,'0101000020E610000051AA5B1BD718EABFEE67613BA4544440'::geometry,ST_Transform('0101000020E610000051AA5B1BD718EABFEE67613BA4544440'::geometry, 3857),'44','02',0.2),
+(45,'0101000020E610000022C5C01BB69710C08563BC1499E54340'::geometry,ST_Transform('0101000020E610000022C5C01BB69710C08563BC1499E54340'::geometry, 3857),'45','08',0.3),
+(46,'0101000020E6100000D5FCF78A11A0E9BFDEA46F8E64AF4340'::geometry,ST_Transform('0101000020E6100000D5FCF78A11A0E9BFDEA46F8E64AF4340'::geometry, 3857),'46','10',0.2),
+(47,'0101000020E61000003AE63525866313C02100050B2BD14440'::geometry,ST_Transform('0101000020E61000003AE63525866313C02100050B2BD14440'::geometry, 3857),'47','07',0.3),
+(48,'0101000020E610000030F187FD1FD206C0C767E1496C9E4540'::geometry,ST_Transform('0101000020E610000030F187FD1FD206C0C767E1496C9E4540'::geometry, 3857),'48','16',0.5),
+(49,'0101000020E61000009C22867B12EC17C006C5F40C14DD4440'::geometry,ST_Transform('0101000020E61000009C22867B12EC17C006C5F40C14DD4440'::geometry, 3857),'49','07',0.2),
+(50,'0101000020E6100000F7D5EFC62D08F1BF69D1231D68CF4440'::geometry,ST_Transform('0101000020E6100000F7D5EFC62D08F1BF69D1231D68CF4440'::geometry, 3857),'50','02',0.6),
+(51,'0101000020E61000005B0E1F8DAA5F15C0530BFE285BF24140'::geometry,ST_Transform('0101000020E61000005B0E1F8DAA5F15C0530BFE285BF24140'::geometry, 3857),'51','18',0.01),
+(10,'0101000020E61000000FD65D82AEA418C06192D1351FDB4340'::geometry,ST_Transform('0101000020E61000000FD65D82AEA418C06192D1351FDB4340'::geometry, 3857),'10','11',0.04),
+(11,'0101000020E6100000B305531DAB0A17C0DEAFCD4EE5464240'::geometry,ST_Transform('0101000020E6100000B305531DAB0A17C0DEAFCD4EE5464240'::geometry, 3857),'11','01',0.08),
+(12,'0101000020E610000059721A7297C9C2BF9EBE383BE51E4440'::geometry,ST_Transform('0101000020E610000059721A7297C9C2BF9EBE383BE51E4440'::geometry, 3857),'12','10',0.2),
+(14,'0101000020E610000000C86313AF3C13C0E530879C10FF4240'::geometry,ST_Transform('0101000020E610000000C86313AF3C13C0E530879C10FF4240'::geometry, 3857),'14','01',0.2),
+(15,'0101000020E61000002A475497B6ED20C06643D4131A904540'::geometry,ST_Transform('0101000020E61000002A475497B6ED20C06643D4131A904540'::geometry, 3857),'15','12',0.3),
+(20,'0101000020E6100000F975566FAD8D01C0E840C33F67924540'::geometry,ST_Transform('0101000020E6100000F975566FAD8D01C0E840C33F67924540'::geometry, 3857),'20','16',0.8),
+(23,'0101000020E610000025FA13E595880BC022BB07131D024340'::geometry,ST_Transform('0101000020E610000025FA13E595880BC022BB07131D024340'::geometry, 3857),'23','01',0.1),
+(24,'0101000020E61000009C5F91C5095C17C0C78784B15A4F4540'::geometry,ST_Transform('0101000020E61000009C5F91C5095C17C0C78784B15A4F4540'::geometry, 3857),'24','07',0.3),
+(29,'0101000020E6100000C34D4A5B48E712C092E680892C684240'::geometry,ST_Transform('0101000020E6100000C34D4A5B48E712C092E680892C684240'::geometry, 3857),'29','01',0.3),
+(52,'0101000020E6100000406A545EB29A07C04E5F0BDA39A54140'::geometry,ST_Transform('0101000020E6100000406A545EB29A07C04E5F0BDA39A54140'::geometry, 3857),'52','19',0.01)
+\i test/fixtures/ppoints2.sql
+-- test table (spanish province centroids with some invented values)
+CREATE TABLE ppoints2 (cartodb_id integer, the_geom geometry, code text, region_code text, numerator float, denominator float);
+INSERT INTO ppoints2 VALUES
+( 1,'0101000020E6100000A8306DC0CBC305C051D14B6CE56A4540'::geometry,'01','16',0.5, 1.0),
+( 4,'0101000020E6100000E220A4362DC202C0FD8AFA5119994240'::geometry,'04','01',0.1, 1.0),
+( 5,'0101000020E610000004377E573AC813C0CB5871BB17494440'::geometry,'05','07',0.3, 1.0),
+( 2,'0101000020E610000000F49BE19BAFFFBF639958FDA6694340'::geometry,'02','08',0.7, 1.0),
+( 3,'0101000020E61000005D0B7E63C832E2BFDB63EB00443D4340'::geometry,'03','10',0.2, 1.0),
+( 6,'0101000020E61000006F3742B7FB9018C0DD967DC4D95A4340'::geometry,'06','11',0.05, 1.0),
+( 7,'0101000020E6100000E4BB36995F4C0740EAC0E5CA9FC94340'::geometry,'07','04',0.4, 1.0),
+( 8,'0101000020E61000003D43CC6CAFBEFF3F6B52E66F91DD4440'::geometry,'08','09',0.7, 1.0),
+( 9,'0101000020E61000003CC797BD99AF0CC0495A87FA312F4540'::geometry,'09','07',0.5, 1.0),
+(13,'0101000020E61000001CAA00A9F19F0EC05DF9267B7A764340'::geometry,'13','08',0.4, 1.0),
+(16,'0101000020E6100000D8208F3CBC9001C065638DC1B1F24340'::geometry,'16','08',0.4, 1.0),
+(17,'0101000020E6100000E9E6A94A71630540AD7A0CB062104540'::geometry,'17','09',0.6, 1.0),
+(18,'0101000020E6100000719792D59E240AC098AC548E00A84240'::geometry,'18','01',0.3, 1.0),
+(19,'0101000020E6100000972C878B50FD04C0123C881D1F684440'::geometry,'19','08',0.7, 1.0),
+(21,'0101000020E6100000F7893E9934511BC0EAA4BF03E1C94240'::geometry,'21','01',0.1, 1.0),
+(22,'0101000020E6100000572C2123B2A8B2BF7ED7FABAFD194540'::geometry,'22','02',0.4, 1.0),
+(25,'0101000020E6100000461B67D688C4F03FD990EEC3A0054540'::geometry,'25','09',0.4, 1.0),
+(26,'0101000020E6100000A139FB06E82204C0539D84F62E234540'::geometry,'26','17',0.6, 1.0),
+(27,'0101000020E6100000A92E54E618C91DC00D3A947B81814540'::geometry,'27','12',0.3, 1.0),
+(28,'0101000020E6100000971DC8B682BC0DC016D0E8055F3F4440'::geometry,'28','13',0.8, 1.0),
+(30,'0101000020E6100000A2DC1964A8C5F7BF19299C994D004340'::geometry,'30','14',0.1, 1.0),
+(31,'0101000020E6100000DCA1FCC87B56FABF9B88E9D866554540'::geometry,'31','15',0.9, 1.0),
+(32,'0101000020E6100000E1517AFCD15E1EC0A18D8D4825194540'::geometry,'32','12',0.3, 1.0),
+(33,'0101000020E6100000A7FF33825AF917C0FABE7DFB6BA54540'::geometry,'33','03',0.4, 1.0),
+(34,'0101000020E6100000FB4E4EBEB72412C0898E7240982F4540'::geometry,'34','07',0.3, 1.0),
+(35,'0101000020E6100000224682B01B1A2DC011091656CC5C3C40'::geometry,'35','05',0.3, 1.0),
+(36,'0101000020E6100000F7C9447110EC20C04C5D4823C7374540'::geometry,'36','12',0.2, 1.0),
+(37,'0101000020E610000053D6A26DFB4218C09D58FAE209674440'::geometry,'37','07',0.5, 1.0),
+(38,'0101000020E6100000B1D1B5FC910431C03C0C89BA03503C40'::geometry,'38','05',0.4, 1.0),
+(39,'0101000020E610000086E6FEE1BD1E10C00417096748994540'::geometry,'39','06',0.6, 1.0),
+(40,'0101000020E6100000FB51C33F733710C038D01729E4954440'::geometry,'40','07',0.5, 1.0),
+(41,'0101000020E6100000912D6FDA28BB16C031321F08C4B74240'::geometry,'41','01',0.4, 1.0),
+(42,'0101000020E6100000554432EABEB504C069ECD78775CF4440'::geometry,'42','07',0.2, 1.0),
+(43,'0101000020E6100000157F117C1A2EEA3F027CD1F2368B4440'::geometry,'43','09',0.3, 1.0),
+(44,'0101000020E610000051AA5B1BD718EABFEE67613BA4544440'::geometry,'44','02',0.2, 1.0),
+(45,'0101000020E610000022C5C01BB69710C08563BC1499E54340'::geometry,'45','08',0.3, 1.0),
+(46,'0101000020E6100000D5FCF78A11A0E9BFDEA46F8E64AF4340'::geometry,'46','10',0.2, 1.0),
+(47,'0101000020E61000003AE63525866313C02100050B2BD14440'::geometry,'47','07',0.3, 1.0),
+(48,'0101000020E610000030F187FD1FD206C0C767E1496C9E4540'::geometry,'48','16',0.5, 1.0),
+(49,'0101000020E61000009C22867B12EC17C006C5F40C14DD4440'::geometry,'49','07',0.2, 1.0),
+(50,'0101000020E6100000F7D5EFC62D08F1BF69D1231D68CF4440'::geometry,'50','02',0.6, 1.0),
+(51,'0101000020E61000005B0E1F8DAA5F15C0530BFE285BF24140'::geometry,'51','18',0.01, 1.0),
+(10,'0101000020E61000000FD65D82AEA418C06192D1351FDB4340'::geometry,'10','11',0.04, 1.0),
+(11,'0101000020E6100000B305531DAB0A17C0DEAFCD4EE5464240'::geometry,'11','01',0.08, 1.0),
+(12,'0101000020E610000059721A7297C9C2BF9EBE383BE51E4440'::geometry,'12','10',0.2, 1.0),
+(14,'0101000020E610000000C86313AF3C13C0E530879C10FF4240'::geometry,'14','01',0.2, 1.0),
+(15,'0101000020E61000002A475497B6ED20C06643D4131A904540'::geometry,'15','12',0.3, 1.0),
+(20,'0101000020E6100000F975566FAD8D01C0E840C33F67924540'::geometry,'20','16',0.8, 1.0),
+(23,'0101000020E610000025FA13E595880BC022BB07131D024340'::geometry,'23','01',0.1, 1.0),
+(24,'0101000020E61000009C5F91C5095C17C0C78784B15A4F4540'::geometry,'24','07',0.3, 1.0),
+(29,'0101000020E6100000C34D4A5B48E712C092E680892C684240'::geometry,'29','01',0.3, 1.0),
+(52,'0101000020E6100000406A545EB29A07C04E5F0BDA39A54140'::geometry,'52','19',0.0, 1.01)
+-- Areas of Interest functions perform some nondeterministic computations
+-- (to estimate the significance); we will set the seeds for the RNGs
+-- that affect those results to have repeateble results
+SELECT cdb_crankshaft._cdb_random_seeds(1234);
+ _cdb_random_seeds 
+-------------------
+ 
 (1 row)
-code|quads
-01|HH
-02|HL
-03|LL
-04|LL
-05|LH
-06|LL
-07|HH
-08|HH
-09|HH
-10|LL
-11|LL
-12|LL
-13|HL
-14|LL
-15|LL
-16|HH
-17|HH
-18|LL
-19|HH
-20|HH
-21|LL
-22|HH
-23|LL
-24|LL
-25|HH
-26|HH
-27|LL
-28|HH
-29|LL
-30|LL
-31|HH
-32|LL
-33|HL
-34|LH
-35|LL
-36|LL
-37|HL
-38|HL
-39|HH
-40|HH
-41|HL
-42|LH
-43|LH
-44|LL
-45|LH
-46|LL
-47|LL
-48|HH
-49|LH
-50|HH
-51|LL
-52|LL
+
+SELECT ppoints.code, m.quads
+  FROM ppoints
+  JOIN cdb_crankshaft.CDB_AreasOfInterest_Local('SELECT * FROM ppoints', 'value') m
+    ON ppoints.cartodb_id = m.ids
+  ORDER BY ppoints.code;
+ code | quads 
+------+-------
+ 01   | HH
+ 02   | HL
+ 03   | LL
+ 04   | LL
+ 05   | LH
+ 06   | LL
+ 07   | HH
+ 08   | HH
+ 09   | HH
+ 10   | LL
+ 11   | LL
+ 12   | LL
+ 13   | HL
+ 14   | LL
+ 15   | LL
+ 16   | HH
+ 17   | HH
+ 18   | LL
+ 19   | HH
+ 20   | HH
+ 21   | LL
+ 22   | HH
+ 23   | LL
+ 24   | LL
+ 25   | HH
+ 26   | HH
+ 27   | LL
+ 28   | HH
+ 29   | LL
+ 30   | LL
+ 31   | HH
+ 32   | LL
+ 33   | HL
+ 34   | LH
+ 35   | LL
+ 36   | LL
+ 37   | HL
+ 38   | HL
+ 39   | HH
+ 40   | HH
+ 41   | HL
+ 42   | LH
+ 43   | LH
+ 44   | LL
+ 45   | LH
+ 46   | LL
+ 47   | LL
+ 48   | HH
+ 49   | LH
+ 50   | HH
+ 51   | LL
+ 52   | LL
 (52 rows)
-_cdb_random_seeds

+SELECT cdb_crankshaft._cdb_random_seeds(1234);
+ _cdb_random_seeds 
+-------------------
+ 
 (1 row)
-code|quads
-01|HH
-02|HL
-07|HH
-08|HH
-09|HH
-13|HL
-16|HH
-17|HH
-19|HH
-20|HH
-22|HH
-25|HH
-26|HH
-28|HH
-31|HH
-33|HL
-37|HL
-38|HL
-39|HH
-40|HH
-41|HL
-48|HH
-50|HH
-(23 rows)
-_cdb_random_seeds

-(1 row)
-code|quads
-03|LL
-04|LL
-05|LH
-06|LL
-10|LL
-11|LL
-12|LL
-14|LL
-15|LL
-18|LL
-21|LL
-23|LL
-24|LL
-27|LL
-29|LL
-30|LL
-32|LL
-34|LH
-35|LL
-36|LL
-42|LH
-43|LH
-44|LL
-45|LH
-46|LL
-47|LL
-49|LH
-51|LL
-52|LL
-(29 rows)
-_cdb_random_seeds
-
-(1 row)
-code|quads
-02|HL
-05|LH
-13|HL
-33|HL
-34|LH
-37|HL
-38|HL
-41|HL
-42|LH
-43|LH
-45|LH
-49|LH
-(12 rows)
-_cdb_random_seeds
-
-(1 row)
-code|quads
-01|LL
-02|LH
-03|HH
-04|HH
-05|LL
-06|HH
-07|LL
-08|LL
-09|LL
-10|HH
-11|HH
-12|HL
-13|LL
-14|HH
-15|LL
-16|LL
-17|LL
-18|LH
-19|LL
-20|LL
-21|HH
-22|LL
-23|HL
-24|LL
-25|LL
-26|LL
-27|LL
-28|LL
-29|LH
-30|HH
-31|LL
-32|LL
-33|LL
-34|LL
-35|LH
-36|HL
-37|LH
-38|LH
-39|LL
-40|LL
-41|LH
-42|HL
-43|LL
-44|HL
-45|LL
-46|HL
-47|LL
-48|LL
-49|HL
-50|LL
-51|HH
+SELECT ppoints2.code, m.quads
+  FROM ppoints2
+  JOIN cdb_crankshaft.CDB_AreasOfInterest_Local_Rate('SELECT * FROM ppoints2', 'numerator', 'denominator') m
+    ON ppoints2.cartodb_id = m.ids
+  ORDER BY ppoints2.code;
+ code | quads 
+------+-------
+ 01   | LL
+ 02   | LH
+ 03   | HH
+ 04   | HH
+ 05   | LL
+ 06   | HH
+ 07   | LL
+ 08   | LL
+ 09   | LL
+ 10   | HH
+ 11   | HH
+ 12   | HL
+ 13   | LL
+ 14   | HH
+ 15   | LL
+ 16   | LL
+ 17   | LL
+ 18   | LH
+ 19   | LL
+ 20   | LL
+ 21   | HH
+ 22   | LL
+ 23   | HL
+ 24   | LL
+ 25   | LL
+ 26   | LL
+ 27   | LL
+ 28   | LL
+ 29   | LH
+ 30   | HH
+ 31   | LL
+ 32   | LL
+ 33   | LL
+ 34   | LL
+ 35   | LH
+ 36   | HL
+ 37   | LH
+ 38   | LH
+ 39   | LL
+ 40   | LL
+ 41   | LH
+ 42   | HL
+ 43   | LL
+ 44   | HL
+ 45   | LL
+ 46   | HL
+ 47   | LL
+ 48   | LL
+ 49   | HL
+ 50   | LL
+ 51   | HH
 (51 rows)
-_cdb_random_seeds

-(1 row)
-code|quads
-03|HH
-04|HH
-06|HH
-10|HH
-11|HH
-12|HL
-14|HH
-21|HH
-23|HL
-30|HH
-36|HL
-42|HL
-44|HL
-46|HL
-49|HL
-51|HH
-(16 rows)
-_cdb_random_seeds
-
-(1 row)
-code|quads
-01|LL
-02|LH
-05|LL
-07|LL
-08|LL
-09|LL
-13|LL
-15|LL
-16|LL
-17|LL
-18|LH
-19|LL
-20|LL
-22|LL
-24|LL
-25|LL
-26|LL
-27|LL
-28|LL
-29|LH
-31|LL
-32|LL
-33|LL
-34|LL
-35|LH
-37|LH
-38|LH
-39|LL
-40|LL
-41|LH
-43|LL
-45|LL
-47|LL
-48|LL
-50|LL
-(35 rows)
-_cdb_random_seeds
-
-(1 row)
-code|quads
-02|LH
-12|HL
-18|LH
-23|HL
-29|LH
-35|LH
-36|HL
-37|LH
-38|LH
-41|LH
-42|HL
-44|HL
-46|HL
-49|HL
-(14 rows)
--- a/src/pg/test/expected/03_overlap_sum_test.out
+++ b/src/pg/test/expected/03_overlap_sum_test.out
@@ -1,11 +1,21 @@
 \i test/fixtures/polyg_values.sql
-SET client_min_messages TO WARNING;
-\set ECHO none
+CREATE TABLE values (cartodb_id integer, value float, the_geom geometry);
+INSERT INTO values(cartodb_id, value, the_geom) VALUES
+(1,10,'0106000020E61000000100000001030000000100000005000000E5AF3500C03608C08068629111374440C7BC0A00C00F02C0AC0551523B414440C7BC0A00C0A700C0CAF23B6E74FB4340A7267FFFFF5206C0FBB7E41B7EE74340E5AF3500C03608C08068629111374440'::geometry),
+(2,20,'0106000020E610000001000000010300000001000000050000002439EC00804AF7BF07D6CCB5C3064440C7BC0A00C0A700C0CAF23B6E74FB4340C7BC0A00C00F02C0AC0551523B414440E20CD5FFFF30FABFBE4F76AFEA4B44402439EC00804AF7BF07D6CCB5C3064440'::geometry)
+SELECT round(cdb_crankshaft.cdb_overlap_sum(
+  '0106000020E61000000100000001030000000100000004000000FFFFFFFFFF3604C09A0B9ECEC42E444000000000C060FBBF30C7FD70E01D44400000000040AD02C06481F1C8CD034440FFFFFFFFFF3604C09A0B9ECEC42E4440'::geometry,
+  'values', 'value'
+), 2);
 round 
 -------
  4.42
 (1 row)

+SELECT round(cdb_crankshaft.cdb_overlap_sum(
+  '0106000020E61000000100000001030000000100000004000000FFFFFFFFFF3604C09A0B9ECEC42E444000000000C060FBBF30C7FD70E01D44400000000040AD02C06481F1C8CD034440FFFFFFFFFF3604C09A0B9ECEC42E4440'::geometry,
+  'values', 'value', schema_name := 'public'
+), 2);
 round 
 -------
  4.42
--- a/src/pg/test/expected/07_gravity_test.out
+++ b/src/pg/test/expected/07_gravity_test.out
@@ -1,11 +0,0 @@
-                  the_geom                  |            h            |           hpop           |      dist
--------------------------------------------+-------------------------+--------------------------+----------------
- 01010000001361C3D32B650140DD24068195B34440 |  1.51078258369747945249 |  12.08626066957983561994 | 4964.714459152
- 01010000002497FF907EFB0040713D0AD7A3B04440 | 98.29730954183620807430 | 688.08116679285345652007 |   99.955141922
- 0101000000A167B3EA733501401D5A643BDFAF4440 | 63.70532894711274639196 | 382.23197368267647835174 | 2488.330566505
- 010100000062A1D634EF380140BE9F1A2FDDB44440 | 35.35415870080995954879 | 176.77079350404979774397 | 4359.370460594
- 010100000052B81E85EB510140355EBA490CB24440 | 33.12290506987740864904 | 132.49162027950963459615 | 3703.664449828
- 0101000000C286A757CA320140736891ED7CAF4440 | 65.45251754279248087849 | 196.35755262837744263547 | 2512.092358644
- 01010000007DD0B359F5390140C976BE9F1AAF4440 | 62.83927792471345639225 | 125.67855584942691278449 |  2926.25725244
- 0101000000D237691A140D01407E6FD39FFDB44440 | 53.54905726651871279586 |  53.54905726651871279586 | 3744.515577777
-(8 rows)
--- a/src/pg/test/fixtures/polyg_values.sql
+++ b/src/pg/test/fixtures/polyg_values.sql
@@ -1,5 +1,3 @@
-SET client_min_messages TO WARNING;
-\set ECHO none
 CREATE TABLE values (cartodb_id integer, value float, the_geom geometry);
 INSERT INTO values(cartodb_id, value, the_geom) VALUES
 (1,10,'0106000020E61000000100000001030000000100000005000000E5AF3500C03608C08068629111374440C7BC0A00C00F02C0AC0551523B414440C7BC0A00C0A700C0CAF23B6E74FB4340A7267FFFFF5206C0FBB7E41B7EE74340E5AF3500C03608C08068629111374440'::geometry),
--- a/src/pg/test/fixtures/ppoints.sql
+++ b/src/pg/test/fixtures/ppoints.sql
@@ -1,5 +1,3 @@
-SET client_min_messages TO WARNING;
-\set ECHO none
 -- test table (spanish province centroids with some invented values)
 CREATE TABLE ppoints (cartodb_id integer, the_geom geometry, the_geom_webmercator geometry, code text, region_code text, value float);
 INSERT INTO ppoints VALUES
--- a/src/pg/test/fixtures/ppoints2.sql
+++ b/src/pg/test/fixtures/ppoints2.sql
@@ -1,5 +1,3 @@
-SET client_min_messages TO WARNING;
-\set ECHO none
 -- test table (spanish province centroids with some invented values)
 CREATE TABLE ppoints2 (cartodb_id integer, the_geom geometry, code text, region_code text, numerator float, denominator float);
 INSERT INTO ppoints2 VALUES
--- a/src/pg/test/sql/02_moran_test.sql
+++ b/src/pg/test/sql/02_moran_test.sql
@@ -1,5 +1,3 @@
-\pset format unaligned
-\set ECHO all
 \i test/fixtures/ppoints.sql
 \i test/fixtures/ppoints2.sql

@@ -10,70 +8,14 @@ SELECT cdb_crankshaft._cdb_random_seeds(1234);

 SELECT ppoints.code, m.quads
  FROM ppoints
-  JOIN cdb_crankshaft.CDB_AreasOfInterestLocal('SELECT * FROM ppoints', 'value') m
-    ON ppoints.cartodb_id = m.rowid
+  JOIN cdb_crankshaft.CDB_AreasOfInterest_Local('SELECT * FROM ppoints', 'value') m
+    ON ppoints.cartodb_id = m.ids
  ORDER BY ppoints.code;

 SELECT cdb_crankshaft._cdb_random_seeds(1234);

-- Spatial Hotspots
-SELECT ppoints.code, m.quads
-  FROM ppoints
-  JOIN cdb_crankshaft.CDB_GetSpatialHotspots('SELECT * FROM ppoints', 'value') m
-    ON ppoints.cartodb_id = m.rowid
-  ORDER BY ppoints.code;
-
-SELECT cdb_crankshaft._cdb_random_seeds(1234);
-
-- Spatial Coldspots
-SELECT ppoints.code, m.quads
-  FROM ppoints
-  JOIN cdb_crankshaft.CDB_GetSpatialColdspots('SELECT * FROM ppoints', 'value') m
-    ON ppoints.cartodb_id = m.rowid
-  ORDER BY ppoints.code;
-
-SELECT cdb_crankshaft._cdb_random_seeds(1234);
-
-  -- Spatial Outliers
-SELECT ppoints.code, m.quads
-  FROM ppoints
-  JOIN cdb_crankshaft.CDB_GetSpatialOutliers('SELECT * FROM ppoints', 'value') m
-    ON ppoints.cartodb_id = m.rowid
-  ORDER BY ppoints.code;
-
-
-SELECT cdb_crankshaft._cdb_random_seeds(1234);
-
-- Areas of Interest (rate)
 SELECT ppoints2.code, m.quads
  FROM ppoints2
-  JOIN cdb_crankshaft.CDB_AreasOfInterestLocalRate('SELECT * FROM ppoints2', 'numerator', 'denominator') m
-    ON ppoints2.cartodb_id = m.rowid
-  ORDER BY ppoints2.code;
-
-SELECT cdb_crankshaft._cdb_random_seeds(1234);
-
-- Spatial Hotspots (rate)
-SELECT ppoints2.code, m.quads
-  FROM ppoints2
-  JOIN cdb_crankshaft.CDB_GetSpatialHotspotsRate('SELECT * FROM ppoints2', 'numerator', 'denominator') m
-    ON ppoints2.cartodb_id = m.rowid
-  ORDER BY ppoints2.code;
-
-SELECT cdb_crankshaft._cdb_random_seeds(1234);
-
-- Spatial Coldspots (rate)
-SELECT ppoints2.code, m.quads
-  FROM ppoints2
-  JOIN cdb_crankshaft.CDB_GetSpatialColdspotsRate('SELECT * FROM ppoints2', 'numerator', 'denominator') m
-    ON ppoints2.cartodb_id = m.rowid
-  ORDER BY ppoints2.code;
-
-SELECT cdb_crankshaft._cdb_random_seeds(1234);
-
-- Spatial Outliers (rate)
-SELECT ppoints2.code, m.quads
-  FROM ppoints2
-  JOIN cdb_crankshaft.CDB_GetSpatialOutliersRate('SELECT * FROM ppoints2', 'numerator', 'denominator') m
-    ON ppoints2.cartodb_id = m.rowid
+  JOIN cdb_crankshaft.CDB_AreasOfInterest_Local_Rate('SELECT * FROM ppoints2', 'numerator', 'denominator') m
+    ON ppoints2.cartodb_id = m.ids
  ORDER BY ppoints2.code;
--- a/src/pg/test/sql/07_gravity_test.sql
+++ b/src/pg/test/sql/07_gravity_test.sql
@@ -1,21 +0,0 @@
-WITH t AS (
-    SELECT
-    ARRAY[1,2,3] AS id,
-    ARRAY[7.0,8.0,3.0] AS w,
-    ARRAY[ST_GeomFromText('POINT(2.1744 41.4036)'),ST_GeomFromText('POINT(2.1228 41.3809)'),ST_GeomFromText('POINT(2.1511 41.3742)')] AS g
-),
-s AS (
-    SELECT
-    ARRAY[10,20,30,40,50,60,70,80] AS id,
-    ARRAY[800, 700, 600, 500, 400, 300, 200, 100] AS p,
-    ARRAY[ST_GeomFromText('POINT(2.1744 41.403)'),ST_GeomFromText('POINT(2.1228 41.380)'),ST_GeomFromText('POINT(2.1511 41.374)'),ST_GeomFromText('POINT(2.1528 41.413)'),ST_GeomFromText('POINT(2.165 41.391)'),ST_GeomFromText('POINT(2.1498 41.371)'),ST_GeomFromText('POINT(2.1533 41.368)'),ST_GeomFromText('POINT(2.131386 41.41399)')] AS g
-)
-SELECT
-    g.the_geom,
-    g.h,
-    g.hpop,
-    g.dist
-FROM
-    t,
-    s,
-    CDB_Gravity(t.id, t.g, t.w, s.id, s.g, s.p, 2, 100000, 3) g;
--- a/src/py/crankshaft/crankshaft/init.py
+++ b/src/py/crankshaft/crankshaft/init.py
@@ -1,2 +1,3 @@
 import random_seeds
 import clustering
+import similarity
--- a/src/py/crankshaft/crankshaft/clustering/moran.py
+++ b/src/py/crankshaft/crankshaft/clustering/moran.py
@@ -14,7 +14,7 @@ import crankshaft.pysal_utils as pu
 # High level interface ---------------------------------------

 def moran(subquery, attr_name,
-          w_type, num_ngbrs, permutations, geom_col, id_col):
+          permutations, geom_col, id_col, w_type, num_ngbrs):
    """
    Moran's I (global)
    Implementation building neighbors with a PostGIS database and Moran's I
@@ -56,7 +56,7 @@ def moran(subquery, attr_name,
    return zip([moran_global.I], [moran_global.EI])

 def moran_local(subquery, attr,
-                w_type, num_ngbrs, permutations, geom_col, id_col):
+                permutations, geom_col, id_col, w_type, num_ngbrs):
    """
    Moran's I implementation for PL/Python
    Andy Eschbacher
@@ -96,7 +96,7 @@ def moran_local(subquery, attr,
    return zip(lisa.Is, quads, lisa.p_sim, weight.id_order, lisa.y)

 def moran_rate(subquery, numerator, denominator,
-               w_type, num_ngbrs, permutations, geom_col, id_col):
+               permutations, geom_col, id_col, w_type, num_ngbrs):
    """
    Moran's I Rate (global)
    Andy Eschbacher
@@ -137,7 +137,7 @@ def moran_rate(subquery, numerator, denominator,
    return zip([lisa_rate.I], [lisa_rate.EI])

 def moran_local_rate(subquery, numerator, denominator,
-                     w_type, num_ngbrs, permutations, geom_col, id_col):
+                     permutations, geom_col, id_col, w_type, num_ngbrs):
    """
        Moran's I Local Rate
        Andy Eschbacher
--- a/src/py/crankshaft/crankshaft/pysal_utils/pysal_utils.py
+++ b/src/py/crankshaft/crankshaft/pysal_utils/pysal_utils.py
@@ -11,7 +11,7 @@ def construct_neighbor_query(w_type, query_vals):
        @param query_vals dict: values used to construct the query
    """

-    if w_type.lower() == 'knn':
+    if w_type == 'knn':
        return knn(query_vals)
    else:
        return queen(query_vals)
@@ -22,7 +22,7 @@ def get_weight(query_res, w_type='knn', num_ngbrs=5):
        Construct PySAL weight from return value of query
        @param query_res: query results with attributes and neighbors
    """
-    if w_type.lower() == 'knn':
+    if w_type == 'knn':
        row_normed_weights = [1.0 / float(num_ngbrs)] * num_ngbrs
        weights = {x['id']: row_normed_weights for x in query_res}
    else:
--- a/src/py/crankshaft/crankshaft/similarity/init.py
+++ b/src/py/crankshaft/crankshaft/similarity/init.py
@@ -0,0 +1 @@
+from similarity import * 
--- a/src/py/crankshaft/crankshaft/similarity/similarity.py
+++ b/src/py/crankshaft/crankshaft/similarity/similarity.py
@@ -0,0 +1,91 @@
+from sklearn.neighbors import NearestNeighbors
+import  scipy.stats as stats
+import numpy as np
+import plpy
+import time
+import cPickle
+
+
+def query_to_dictionary(result):
+    return [ dict(zip(r.keys(), r.values())) for r in result ]
+
+def drop_all_nan_columns(data):
+    return data[ :, ~np.isnan(data).all(axis=0)]
+    
+def fill_missing_na(data,val=None):
+    inds = np.where(np.isnan(data))
+    if val==None:
+        col_mean = stats.nanmean(data,axis=0)
+        data[inds]=np.take(col_mean,inds[1])
+    else:
+        data[inds]=np.take(val, inds[1])
+    return data
+    
+def similarity_rank(target_cartodb_id, query):
+    start_time  = time.time() 
+    #plpy.notice('converting to dictionary ', start_time) 
+    #data = query_to_dictionary(plpy.execute(query))  
+    plpy.notice('coverted , running query ', time.time() - start_time) 
+    
+    data = plpy.execute(query_only_values(query))
+    plpy.notice('run query  , getting cartodb_idsi', time.time() - start_time)
+    cartodb_ids = plpy.execute(query_cartodb_id(query))[0]['a']
+    target_id  = cartodb_ids.index(target_cartodb_id)
+    plpy.notice('run query  , extracting ', time.time() - start_time)
+    features, target = extract_features_target(data,target_id)
+    plpy.notice('extracted  , cleaning ', time.time() - start_time)
+    features = fill_missing_na(drop_all_nan_columns(features))
+    plpy.notice('cleaned , normalizing', start_time - time.time())
+    
+    normed_features, normed_target  = normalize_features(features,target)
+    plpy.notice('normalized , training ', time.time() - start_time )
+    tree = train(normed_features)
+    plpy.notice('normalized , pickling ', time.time() - start_time )
+    #plpy.notice('tree_dump ',  len(cPickle.dumps(tree, protocol=cPickle.HIGHEST_PROTOCOL)))
+    plpy.notice('pickles, querying ', time.time() - start_time)
+    dist, ind  = tree.kneighbors(normed_target)
+    plpy.notice('queried , rectifying', time.time() - start_time)
+    return zip(cartodb_ids, dist[0])
+
+def query_cartodb_id(query):
+    return 'select array_agg(cartodb_id) a from ({0}) b'.format(query)
+
+def query_only_values(query):
+    first_row = plpy.execute('select * from ({query}) a limit 1'.format(query=query))
+    just_values = ','.join([ key for key in  first_row[0].keys()  if key not in ['the_geom', 'the_geom_webmercator','cartodb_id']])
+    return 'select Array[{0}] a from ({1}) b '.format(just_values, query)
+
+
+def most_similar(matches,query):
+    data = plpy.execute(query)    
+    features, _ = extract_features_target(data)
+    results = []
+    for i in features:
+        target = features
+        dist,ind = tree.query(target, k=matches)
+        cartodb_ids  = [ dist[ind]['cartodb_id'] for index in ind ]
+        results.append(cartodb_ids)
+    return cartodb_ids, results
+    
+    
+def train(features):
+    tree = NearestNeighbors( n_neighbors=len(features), algorithm='auto').fit(features)
+    return tree
+    
+def normalize_features(features, target):
+    maxes = features.max(axis=0)
+    mins  = features.min(axis=0)
+    return (features - mins)/(maxes-mins), (target-mins)/(maxes-mins)
+ 
+def extract_row(row):
+    keys = row.keys()
+    values = row.values()
+    del values[ keys.index('cartodb_id')]
+    return values
+
+def extract_features_target(data, target_index=None):
+    target   = None
+    features = [row['a'] for row in data]
+    target   = features[target_index]
+    return np.array(features, dtype=float), np.array(target, dtype=float)
+    
--- a/src/py/crankshaft/setup.py
+++ b/src/py/crankshaft/setup.py
@@ -40,9 +40,9 @@ setup(

    # The choice of component versions is dictated by what's
    # provisioned in the production servers.
-    install_requires=['pysal==1.9.1'],
+    install_requires=['pysal==1.9.1', 'scikit-learn==0.17.1'],

-    requires=['pysal', 'numpy' ],
+    requires=['pysal', 'numpy','sklearn'],

    test_suite='test'
 )
--- a/src/py/crankshaft/test/test_clustering_moran.py
+++ b/src/py/crankshaft/test/test_clustering_moran.py
@@ -52,7 +52,7 @@ class MoranTest(unittest.TestCase):
        data = [ { 'id': d['id'], 'attr1': d['value'], 'neighbors': d['neighbors'] } for d in self.neighbors_data]
        plpy._define_result('select', data)
        random_seeds.set_random_seeds(1234)
-        result = cc.moran_local('subquery', 'value', 'knn', 5, 99, 'the_geom', 'cartodb_id')
+        result = cc.moran_local('subquery', 'value', 99, 'the_geom', 'cartodb_id', 'knn', 5)
        result = [(row[0], row[1]) for row in result]
        expected = self.moran_data
        for ([res_val, res_quad], [exp_val, exp_quad]) in zip(result, expected):
@@ -64,7 +64,7 @@ class MoranTest(unittest.TestCase):
        data = [ { 'id': d['id'], 'attr1': d['value'], 'attr2': 1, 'neighbors': d['neighbors'] } for d in self.neighbors_data]
        plpy._define_result('select', data)
        random_seeds.set_random_seeds(1234)
-        result = cc.moran_local_rate('subquery', 'numerator', 'denominator', 'knn', 5, 99, 'the_geom', 'cartodb_id')
+        result = cc.moran_local_rate('subquery', 'numerator', 'denominator', 99, 'the_geom', 'cartodb_id', 'knn', 5)
        print 'result == None? ', result == None
        result = [(row[0], row[1]) for row in result]
        expected = self.moran_data
@@ -76,7 +76,7 @@ class MoranTest(unittest.TestCase):
        data = [{ 'id': d['id'], 'attr1': d['value'], 'neighbors': d['neighbors'] } for d in self.neighbors_data]
        plpy._define_result('select', data)
        random_seeds.set_random_seeds(1235)
-        result = cc.moran('table', 'value', 'knn', 5, 99, 'the_geom', 'cartodb_id')
+        result = cc.moran('table', 'value', 99, 'the_geom', 'cartodb_id', 'knn', 5)
        print 'result == None?', result == None
        result_moran = result[0][0]
        expected_moran = np.array([row[0] for row in self.moran_data]).mean()
Author	SHA1	Message	Date
Ubuntu	97b4949f84	performance imporvments	2016-05-27 19:31:37 +00:00
Ubuntu	df09d03de6	adding sklearn to deps	2016-05-27 14:59:24 +00:00
Ubuntu	b3c55614e3	fixing syntax	2016-05-27 14:58:43 +00:00
Ubuntu	1ddc338f3f	adding missing ;	2016-05-27 14:58:05 +00:00
Stuart Lynn	d7424b02e5	adding import to crankshaft __init__	2016-05-27 10:33:00 -04:00
Stuart Lynn	45705f3a16	adding function preflight	2016-05-27 10:29:47 -04:00
Stuart Lynn	1995721921	adding functions to drop columns which are all nan and fill nan values with the mean of those columns	2016-05-27 10:29:15 -04:00
Ubuntu	4630d6b549	debugging	2016-05-26 19:32:49 +00:00
Stuart Lynn	0fca6c3c1a	inital commit of similarity functions	2016-05-26 12:31:58 -04:00