adding cdb union adjacent

2016-03-16 15:36:32 -04:00
25 changed files with 498 additions and 931 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,3 +1,2 @@
 envs/
 *.pyc
-.DS_Store
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -1,6 +1,6 @@
 # Development process

-Please read the Working Process/Quickstart Guide in [README.md](https://github.com/CartoDB/crankshaft/blob/master/README.md) first.
+Please read the Working Process/Quickstart Guide in README.md first.

 For any modification of crankshaft, such as adding new features,
 refactoring or bug-fixing, topic branch must be created out of the `develop`
@@ -45,8 +45,8 @@ source envs/dev/bin/activate

 Update extension in a working database with:

-* `ALTER EXTENSION crankshaft UPDATE TO 'current';`
-  `ALTER EXTENSION crankshaft UPDATE TO 'dev';`
+* `ALTER EXTENSION crankshaft VERSION TO 'current';`
+  `ALTER EXTENSION crankshaft VERSION TO 'dev';`

 Note: we keep the current development version install as 'dev' always;
 we update through the 'current' alias to allow changing the extension
@@ -58,10 +58,7 @@ should be dropped manually before the update.
 If the extension has not previously been installed in a database,
 it can be installed directly with:

-* `CREATE EXTENSION IF NOT EXISTS plpythonu;`
-  `CREATE EXTENSION IF NOT EXISTS postgis;`
-  `CREATE EXTENSION IF NOT EXISTS cartodb;`
-  `CREATE EXTENSION crankshaft WITH VERSION 'dev';`
+* `CREATE EXTENSION crankshaft WITH VERSION 'dev';`

 Note: the development extension uses the development python virtual
 environment automatically.
--- a/README.md
+++ b/README.md
@@ -14,7 +14,7 @@ CartoDB Spatial Analysis extension for PostgreSQL.
 ## Requirements

 * pip, virtualenv, PostgreSQL
-* python-scipy system package (see [src/py/README.md](https://github.com/CartoDB/crankshaft/blob/master/src/py/README.md))
+* python-scipy system package (see src/py/README.md)

 # Working Process -- Quickstart Guide

@@ -33,7 +33,7 @@ deployed in production.
 Developers shall create a new topic branch from `develop` for any new feature
 or bugfix and commit their changes to it and eventually merge back into
 the `develop` branch. When a new release is required a Pull Request
-will be open against the `develop` branch.
+will be open againt the `develop` branch.

 The `develop` pull requests will be handled by the release manage,
 who will merge into master where new releases are prepared and tagged.
@@ -43,7 +43,7 @@ and developers must not commit or merge into it.
 ## Development Guidelines

 For a detailed description of the development process please see
-the [CONTRIBUTING.md](https://github.com/CartoDB/crankshaft/blob/master/CONTRIBUTING.md) guide.
+the CONTRIBUTING.md guide.

 Any modification to the source code (`src/pg/sql` for the SQL extension,
 `src/py/crankshaft` for the Python package) shall always be done
@@ -52,7 +52,7 @@ in a topic branch created from the `develop` branch.
 Tests, documentation and peer code reviewing are required for all
 modifications.

-The tests (both for SQL and Python) are executed by running,
+The tests (both for SQL and Pyhton) are executed by running,
 from the top directory:

 ```
@@ -67,5 +67,5 @@ branch.
 ## Release

 The release and deployment process is described in the
-[RELEASE.md](https://github.com/CartoDB/crankshaft/blob/master/RELEASE.md) guide and it is the responsibility of the designated
+RELEASE.md guide and it is the responsibility of the designated
 release manager.
--- a/doc/02_moran.md
+++ b/doc/02_moran.md
@@ -1,102 +1,4 @@
-## Name
-
-CDB_AreasOfInterest -- returns a table with a cluster/outlier classification, the significance of a classification, an autocorrelation statistic (Local Moran's I), and the geometry id for each geometry in the original dataset.
-
-## Synopsis
-
-```sql
-table(numeric moran_val, text quadrant, numeric significance, int ids, numeric column_values) CDB_AreasOfInterest(text query, text column_name)
-
-table(numeric moran_val, text quadrant, numeric significance, int ids, numeric column_values) CDB_AreasOfInterest(text query, text column_name, int permutations, text geom_column, text id_column, text weight_type, int num_ngbrs)
-```
-
-## Description
-
-CDB_AreasOfInterest is a table-returning function that classifies the geometries in a table by an attribute and gives a significance for that classification. This information can be used to find "Areas of Interest" by using the correlation of a geometry's attribute with that of its neighbors. Areas can be clusters, outliers, or neither (depending on which significance value is used).
-
-Inputs:
-
-* `query` (required): an arbitrary query against tables you have access to (e.g., in your account, shared in your organization, or through the Data Observatory). This string must contain the following columns: an id `INT` (e.g., `cartodb_id`), geometry (e.g., `the_geom`), and the numeric attribute which is specified in `column_name`
-* `column_name` (required): column to perform the area of interest analysis tool on. The data must be numeric (e.g., `float`, `int`, etc.)
-* `permutations` (optional): used to calculate the significance of a classification. Defaults to 99, which is sufficient in most situations.
-* `geom_column` (optional): the name of the geometry column. Data must be of type `geometry`.
-* `id_column` (optional): the name of the id column (e.g., `cartodb_id`). Data must be of type `int` or `bigint` and have a unique condition on the data.
-* `weight_type` (optional): the type of weight used for determining what defines a neighborhood. Options are `knn` or `queen`.
-* `num_ngbrs` (optional): the number of neighbors in a neighborhood around a geometry. Only used if `knn` is chosen above.
-
-Outputs:
-
-* `moran_val`: underlying correlation statistic used in analysis
-* `quadrant`: human-readable interpretation of classification
-* `significance`: significance of classification (closer to 0 is more significant)
-* `ids`: id of original geometry (used for joining against original table if desired -- see examples)
-* `column_values`: original column values from `column_name`
-
-Availability: crankshaft v0.0.1 and above
-
-## Examples
-
-```sql
-SELECT
-  t.the_geom_webmercator,
-  t.cartodb_id,
-  aoi.significance,
-  aoi.quadrant As aoi_quadrant
-FROM
-  observatory.acs2013 As t
-JOIN
-  crankshaft.CDB_AreasOfInterest('SELECT * FROM observatory.acs2013',
-                                 'gini_index')
-```
-
-## API Usage
-
-Example
-
-```text
-http://eschbacher.cartodb.com/api/v2/sql?q=SELECT * FROM crankshaft.CDB_AreasOfInterest('SELECT * FROM observatory.acs2013','gini_index')
-```
-
-Result
-```json
-{
-  time: 0.120,
-  total_rows: 100,
-  rows: [{
-    moran_vals: 0.7213,
-    quadrant: 'High area',
-    significance: 0.03,
-    ids: 1,
-    column_value: 0.22
-  },
-  {
-    moran_vals: -0.7213,
-    quadrant: 'Low outlier',
-    significance: 0.13,
-    ids: 2,
-    column_value: 0.03
-  },
-  ...
-  ]
-}
-```
-
-## See Also
-
-crankshaft's areas of interest functions:
-
-* [CDB_AreasOfInterest_Global]()
-* [CDB_AreasOfInterest_Rate_Local]()
-* [CDB_AreasOfInterest_Rate_Global]()
-
-
-PostGIS clustering functions:
-
-* [ST_ClusterIntersecting](http://postgis.net/docs/manual-2.2/ST_ClusterIntersecting.html)
-* [ST_ClusterWithin](http://postgis.net/docs/manual-2.2/ST_ClusterWithin.html)
-
-
-- removing below, working into above
+### Moran's I

 #### What is Moran's I and why is it significant for CartoDB?

--- a/doc/docs_template.md
+++ b/doc/docs_template.md
@@ -1,24 +0,0 @@
-
-## Name
-
-## Synopsis
-
-## Description
-
-Availability: v...
-
-## Examples
-
-```SQL
-- example of the function in use
-SELECT cdb_awesome_function(the_geom, 'total_pop')
-FROM table_name
-```
-
-## API Usage
-
-_asdf_
-
-## See Also
-
-_Other function pages_
--- a/src/pg/sql/10_moran.sql
+++ b/src/pg/sql/10_moran.sql
@@ -1,89 +1,37 @@
-- Moran's I (global)
+-- Moran's I
 CREATE OR REPLACE FUNCTION
-  CDB_AreasOfInterest_Global (
-      subquery TEXT,
-      attr_name TEXT,
-      permutations INT DEFAULT 99,
-      geom_col TEXT DEFAULT 'the_geom',
-      id_col TEXT DEFAULT 'cartodb_id',
-      w_type TEXT DEFAULT 'knn',
-      num_ngbrs INT DEFAULT 5)
-RETURNS TABLE (moran NUMERIC, significance NUMERIC)
+  cdb_moran_local (
+      t TEXT,
+  	  attr TEXT,
+  	  significance float DEFAULT 0.05,
+  	  num_ngbrs INT DEFAULT 5,
+  	  permutations INT DEFAULT 99,
+  	  geom_column TEXT DEFAULT 'the_geom',
+  	  id_col TEXT DEFAULT 'cartodb_id',
+      w_type TEXT DEFAULT 'knn')
+RETURNS TABLE (moran FLOAT, quads TEXT, significance FLOAT, ids INT)
 AS $$
+  plpy.execute('SELECT cdb_crankshaft._cdb_crankshaft_activate_py()')
  from crankshaft.clustering import moran_local
-  # TODO: use named parameters or a dictionary
-  return moran(subquery, attr, num_ngbrs, permutations, geom_col, id_col, w_type)
+  # TODO: use named parameters or a dictionary
+  return moran_local(t, attr, significance, num_ngbrs, permutations, geom_column, id_col, w_type)
 $$ LANGUAGE plpythonu;

-- Moran's I Local
-CREATE OR REPLACE FUNCTION
-  CDB_AreasOfInterest_Local(
-      subquery TEXT,
-      attr TEXT,
-      permutations INT DEFAULT 99,
-      geom_col TEXT DEFAULT 'the_geom',
-      id_col TEXT DEFAULT 'cartodb_id',
-      w_type TEXT DEFAULT 'knn',
-      num_ngbrs INT DEFAULT 5)
-RETURNS TABLE (moran NUMERIC, quads TEXT, significance NUMERIC, ids INT, y NUMERIC)
-AS $$
-  from crankshaft.clustering import moran_local
-  # TODO: use named parameters or a dictionary
-  return moran_local(subquery, attr, permutations, geom_col, id_col, w_type, num_ngbrs)
-$$ LANGUAGE plpythonu;
-
-- Moran's I Rate (global)
-CREATE OR REPLACE FUNCTION
-  CDB_AreasOfInterest_Global_Rate(
-      subquery TEXT,
-      numerator TEXT,
-      denominator TEXT,
-      permutations INT DEFAULT 99,
-      geom_col TEXT DEFAULT 'the_geom',
-      id_col TEXT DEFAULT 'cartodb_id',
-      w_type TEXT DEFAULT 'knn',
-      num_ngbrs INT DEFAULT 5)
-RETURNS TABLE (moran FLOAT, significance FLOAT)
-AS $$
-  from crankshaft.clustering import moran_local
-  # TODO: use named parameters or a dictionary
-  return moran_rate(subquery, numerator, denominator, permutations, geom_col, id_col, w_type, num_ngbrs)
-$$ LANGUAGE plpythonu;
-
-
 -- Moran's I Local Rate
 CREATE OR REPLACE FUNCTION
-  CDB_AreasOfInterest_Local_Rate(
-      subquery TEXT,
-      numerator TEXT,
-      denominator TEXT,
-      permutations INT DEFAULT 99,
-      geom_col TEXT DEFAULT 'the_geom',
-      id_col TEXT DEFAULT 'cartodb_id',
-      w_type TEXT DEFAULT 'knn',
-      num_ngbrs INT DEFAULT 5)
-RETURNS
-TABLE(moran NUMERIC, quads TEXT, significance NUMERIC, ids INT, y NUMERIC)
+  cdb_moran_local_rate(t TEXT,
+		 numerator TEXT,
+		 denominator TEXT,
+		 significance FLOAT DEFAULT 0.05,
+		 num_ngbrs INT DEFAULT 5,
+		 permutations INT DEFAULT 99,
+		 geom_column TEXT DEFAULT 'the_geom',
+		 id_col TEXT DEFAULT 'cartodb_id',
+		 w_type TEXT DEFAULT 'knn')
+RETURNS TABLE(moran FLOAT, quads TEXT, significance FLOAT, ids INT, y numeric)
 AS $$
+  plpy.execute('SELECT cdb_crankshaft._cdb_crankshaft_activate_py()')
  from crankshaft.clustering import moran_local_rate
-  # TODO: use named parameters or a dictionary
-  return moran_local_rate(subquery, numerator, denominator, permutations, geom_col, id_col, w_type, num_ngbrs)
+  # TODO: use named parameters or a dictionary
+  return moran_local_rate(t, numerator, denominator, significance, num_ngbrs, permutations, geom_column, id_col, w_type)
 $$ LANGUAGE plpythonu;
-
-- -- Moran's I Local Bivariate
-- CREATE OR REPLACE FUNCTION
--   cdb_moran_local_bv(
--       subquery TEXT,
--       attr1 TEXT,
--       attr2 TEXT,
--       permutations INT DEFAULT 99,
--       geom_col TEXT DEFAULT 'the_geom',
--       id_col TEXT DEFAULT 'cartodb_id',
--       w_type TEXT DEFAULT 'knn',
--       num_ngbrs INT DEFAULT 5)
-- RETURNS TABLE(moran FLOAT, quads TEXT, significance FLOAT, ids INT, y numeric)
-- AS $$
--   from crankshaft.clustering import moran_local_bv
--   # TODO: use named parameters or a dictionary
--   return moran_local_bv(t, attr1, attr2, permutations, geom_col, id_col, w_type, num_ngbrs)
-- $$ LANGUAGE plpythonu;
--- a/src/pg/sql/11_cdb_union_adjacent.sql
+++ b/src/pg/sql/11_cdb_union_adjacent.sql
@@ -0,0 +1,43 @@
+CREATE OR REPLACE FUNCTION _cdb_final_union_adjacent( joined_geoms geometry[] )
+RETURNS geometry[] AS $$
+BEGIN
+    RETURN joined_geoms;
+END
+$$ LANGUAGE plpgsql;
+
+
+CREATE OR REPLACE FUNCTION _cdb_state_update_union_adjacent(clusters geometry[], new_geom  geometry)
+RETURNS geometry[] AS $$
+DECLARE
+  joins  geometry[] :='{}';
+  unjoined geometry[] :='{}';
+  i integer;
+  combined geometry;
+BEGIN
+  joins := (select array_agg(g)
+            from unnest(clusters) a(g)
+            where ST_TOUCHES(g, new_geom));
+
+  unjoined := (select array_agg(g)
+               from unnest(clusters) a(g)
+               where ST_TOUCHES(g, new_geom) = false);
+
+  IF array_length(joins, 1) > 0 THEN
+    joins := array_append(joins, new_geom);
+    combined := ST_UNION(joins);
+  ELSE
+    combined := new_geom;
+  END IF;
+
+  unjoined := array_append(unjoined, combined);
+  RETURN unjoined;
+END
+$$
+LANGUAGE plpgsql;
+
+CREATE AGGREGATE cdb_union_adjacent(geometry)(
+  SFUNC=_cdb_state_update_union_adjacent,
+  STYPE=geometry[],
+  FINALFUNC=_cdb_final_union_adjacent,
+  INITCOND='{}'
+);
--- a/src/pg/sql/80_similarity_rank.sql
+++ b/src/pg/sql/80_similarity_rank.sql
@@ -1,15 +0,0 @@
-CREATE OR REPLACE FUNCTION cdb_SimilarityRank(cartodb_id numeric, query text)
-returns TABLE (cartodb_id NUMERIC, similarity NUMERIC)
-as $$
-  plpy.execute('SELECT cdb_crankshaft._cdb_crankshaft_activate_py()')
-  from crankshaft.similarity import similarity_rank
-  return similarity_rank(cartodb_id, query)
-$$ LANGUAGE plpythonu;
-
-CREATE OR REPLACE FUNCTION cdb_MostSimilar(cartodb_id numeric, query text ,matches numeric)
-returns TABLE (cartodb_id NUMERIC, similarity NUMERIC)
-as $$
-  plpy.execute('SELECT cdb_crankshaft._cdb_crankshaft_activate_py()')
-  from crankshaft.similarity import most_similar
-  return most_similar(matches, query)
-$$ LANGUAGE plpythonu;
--- a/src/pg/test/expected/02_moran_test.out
+++ b/src/pg/test/expected/02_moran_test.out
@@ -110,7 +110,7 @@ INSERT INTO ppoints2 VALUES
 (24,'0101000020E61000009C5F91C5095C17C0C78784B15A4F4540'::geometry,'24','07',0.3, 1.0),
 (29,'0101000020E6100000C34D4A5B48E712C092E680892C684240'::geometry,'29','01',0.3, 1.0),
 (52,'0101000020E6100000406A545EB29A07C04E5F0BDA39A54140'::geometry,'52','19',0.0, 1.01)
-- Areas of Interest functions perform some nondeterministic computations
+-- Moral functions perform some nondeterministic computations
 -- (to estimate the significance); we will set the seeds for the RNGs
 -- that affect those results to have repeateble results
 SELECT cdb_crankshaft._cdb_random_seeds(1234);
@@ -121,64 +121,18 @@ SELECT cdb_crankshaft._cdb_random_seeds(1234);

 SELECT ppoints.code, m.quads
  FROM ppoints
-  JOIN cdb_crankshaft.CDB_AreasOfInterest_Local('SELECT * FROM ppoints', 'value') m
+  JOIN cdb_crankshaft.cdb_moran_local('SELECT * FROM ppoints', 'value') m
    ON ppoints.cartodb_id = m.ids
  ORDER BY ppoints.code;
+NOTICE:  ** Constructing query
+CONTEXT:  PL/Python function "cdb_moran_local"
+NOTICE:  ** Query failed: "SELECT i."cartodb_id" As id, i."value"::numeric As attr1, (SELECT ARRAY(SELECT j."cartodb_id" FROM "(SELECT * FROM ppoints)" As j WHERE j."value" IS NOT NULL ORDER BY j."the_geom" <-> i."the_geom" ASC LIMIT 5 OFFSET 1 ) ) As neighbors FROM "(SELECT * FROM ppoints)" As i WHERE i."value" IS NOT NULL ORDER BY i."cartodb_id" ASC;"
+CONTEXT:  PL/Python function "cdb_moran_local"
+NOTICE:  ** Exiting function
+CONTEXT:  PL/Python function "cdb_moran_local"
 code | quads 
 ------+-------
- 01   | HH
- 02   | HL
- 03   | LL
- 04   | LL
- 05   | LH
- 06   | LL
- 07   | HH
- 08   | HH
- 09   | HH
- 10   | LL
- 11   | LL
- 12   | LL
- 13   | HL
- 14   | LL
- 15   | LL
- 16   | HH
- 17   | HH
- 18   | LL
- 19   | HH
- 20   | HH
- 21   | LL
- 22   | HH
- 23   | LL
- 24   | LL
- 25   | HH
- 26   | HH
- 27   | LL
- 28   | HH
- 29   | LL
- 30   | LL
- 31   | HH
- 32   | LL
- 33   | HL
- 34   | LH
- 35   | LL
- 36   | LL
- 37   | HL
- 38   | HL
- 39   | HH
- 40   | HH
- 41   | HL
- 42   | LH
- 43   | LH
- 44   | LL
- 45   | LH
- 46   | LL
- 47   | LL
- 48   | HH
- 49   | LH
- 50   | HH
- 51   | LL
- 52   | LL
-(52 rows)
+(0 rows)

 SELECT cdb_crankshaft._cdb_random_seeds(1234);
 _cdb_random_seeds 
@@ -188,61 +142,17 @@ SELECT cdb_crankshaft._cdb_random_seeds(1234);

 SELECT ppoints2.code, m.quads
  FROM ppoints2
-  JOIN cdb_crankshaft.CDB_AreasOfInterest_Local_Rate('SELECT * FROM ppoints2', 'numerator', 'denominator') m
+  JOIN cdb_crankshaft.cdb_moran_local_rate('SELECT * FROM ppoints2', 'numerator', 'denominator') m
    ON ppoints2.cartodb_id = m.ids
  ORDER BY ppoints2.code;
- code | quads 
------+-------
- 01   | LL
- 02   | LH
- 03   | HH
- 04   | HH
- 05   | LL
- 06   | HH
- 07   | LL
- 08   | LL
- 09   | LL
- 10   | HH
- 11   | HH
- 12   | HL
- 13   | LL
- 14   | HH
- 15   | LL
- 16   | LL
- 17   | LL
- 18   | LH
- 19   | LL
- 20   | LL
- 21   | HH
- 22   | LL
- 23   | HL
- 24   | LL
- 25   | LL
- 26   | LL
- 27   | LL
- 28   | LL
- 29   | LH
- 30   | HH
- 31   | LL
- 32   | LL
- 33   | LL
- 34   | LL
- 35   | LH
- 36   | HL
- 37   | LH
- 38   | LH
- 39   | LL
- 40   | LL
- 41   | LH
- 42   | HL
- 43   | LL
- 44   | HL
- 45   | LL
- 46   | HL
- 47   | LL
- 48   | LL
- 49   | HL
- 50   | LL
- 51   | HH
-(51 rows)
-
+NOTICE:  ** Constructing query
+CONTEXT:  PL/Python function "cdb_moran_local_rate"
+NOTICE:  ** Query failed: "SELECT i."cartodb_id" As id, i."denominator"::numeric As attr1, i."numerator"::numeric As attr2, (SELECT ARRAY(SELECT j."cartodb_id" FROM "(SELECT * FROM ppoints2)" As j WHERE j."denominator" IS NOT NULL AND j."numerator" IS NOT NULL AND j."numerator" <> 0 ORDER BY j."the_geom" <-> i."the_geom" ASC LIMIT 5 OFFSET 1 ) ) As neighbors FROM "(SELECT * FROM ppoints2)" As i WHERE i."denominator" IS NOT NULL AND i."numerator" IS NOT NULL AND i."numerator" <> 0 ORDER BY i."cartodb_id" ASC;"
+CONTEXT:  PL/Python function "cdb_moran_local_rate"
+NOTICE:  ** Error: <class 'plpy.SPIError'>
+CONTEXT:  PL/Python function "cdb_moran_local_rate"
+NOTICE:  ** Exiting function
+CONTEXT:  PL/Python function "cdb_moran_local_rate"
+ERROR:  length of returned sequence did not match number of columns in row
+CONTEXT:  while creating return value
+PL/Python function "cdb_moran_local_rate"
--- a/src/pg/test/expected/11_cdb_union_adjacent.out
+++ b/src/pg/test/expected/11_cdb_union_adjacent.out
@@ -0,0 +1,21 @@
+\i test/fixtures/touching_polygons.sql
+-- test table (polygons, some of which touch and some which dont)
+CREATE TABLE touching_polygons(cartodb_id integer, the_geom geometry);
+INSERT INTO  touching_polygons VALUES
+(1, ST_GeomFromText('POLYGON ((0 0, 1 0,1 1, 0 1, 0 0 ))')),
+(2, ST_GeomFromText('POLYGON ((1 0, 2 0, 2 1, 1 1, 1 0))')),
+(1, ST_GeomFromText('POLYGON ((0 1, 1 1,1 2, 0 2, 0 1 ))')),
+(4, ST_GeomFromText('POLYGON ((3 0, 4 0, 4 1, 3 1, 3 0))')),
+(5, ST_GeomFromText('POLYGON ((3 1, 4 1, 4 2, 3 2, 3 1))'));
+WITH joined_polygons AS (
+  SELECT cdb_crankshaft.cdb_union_adjacent(the_geom) the_geom FROM touching_polygons
+),
+unnested_polygons as (
+  select unnest(joined_polygons.the_geom) the_geom from joined_polygons
+)
+select ST_ASTEXT(unnested_polygons.the_geom) from unnested_polygons;
+                   st_astext                    
+------------------------------------------------
+ POLYGON((1 0,0 0,0 1,0 2,1 2,1 1,2 1,2 0,1 0))
+ POLYGON((4 1,4 0,3 0,3 1,3 2,4 2,4 1))
+(2 rows)
--- a/src/pg/test/fixtures/touching_polygons.sql
+++ b/src/pg/test/fixtures/touching_polygons.sql
@@ -0,0 +1,8 @@
+-- test table (polygons, some of which touch and some which dont)
+CREATE TABLE touching_polygons(cartodb_id integer, the_geom geometry);
+INSERT INTO  touching_polygons VALUES
+(1, ST_GeomFromText('POLYGON ((0 0, 1 0,1 1, 0 1, 0 0 ))')),
+(2, ST_GeomFromText('POLYGON ((1 0, 2 0, 2 1, 1 1, 1 0))')),
+(1, ST_GeomFromText('POLYGON ((0 1, 1 1,1 2, 0 2, 0 1 ))')),
+(4, ST_GeomFromText('POLYGON ((3 0, 4 0, 4 1, 3 1, 3 0))')),
+(5, ST_GeomFromText('POLYGON ((3 1, 4 1, 4 2, 3 2, 3 1))'));
--- a/src/pg/test/sql/02_moran_test.sql
+++ b/src/pg/test/sql/02_moran_test.sql
@@ -1,14 +1,14 @@
 \i test/fixtures/ppoints.sql
 \i test/fixtures/ppoints2.sql

-- Areas of Interest functions perform some nondeterministic computations
+-- Moral functions perform some nondeterministic computations
 -- (to estimate the significance); we will set the seeds for the RNGs
 -- that affect those results to have repeateble results
 SELECT cdb_crankshaft._cdb_random_seeds(1234);

 SELECT ppoints.code, m.quads
  FROM ppoints
-  JOIN cdb_crankshaft.CDB_AreasOfInterest_Local('SELECT * FROM ppoints', 'value') m
+  JOIN cdb_crankshaft.cdb_moran_local('SELECT * FROM ppoints', 'value') m
    ON ppoints.cartodb_id = m.ids
  ORDER BY ppoints.code;

@@ -16,6 +16,6 @@ SELECT cdb_crankshaft._cdb_random_seeds(1234);

 SELECT ppoints2.code, m.quads
  FROM ppoints2
-  JOIN cdb_crankshaft.CDB_AreasOfInterest_Local_Rate('SELECT * FROM ppoints2', 'numerator', 'denominator') m
+  JOIN cdb_crankshaft.cdb_moran_local_rate('SELECT * FROM ppoints2', 'numerator', 'denominator') m
    ON ppoints2.cartodb_id = m.ids
  ORDER BY ppoints2.code;
--- a/src/pg/test/sql/11_cdb_union_adjacent_test.sql
+++ b/src/pg/test/sql/11_cdb_union_adjacent_test.sql
@@ -0,0 +1,9 @@
+\i test/fixtures/touching_polygons.sql
+
+WITH joined_polygons AS (
+  SELECT cdb_crankshaft.cdb_union_adjacent(the_geom) the_geom FROM touching_polygons
+),
+unnested_polygons as (
+  select unnest(joined_polygons.the_geom) the_geom from joined_polygons
+)
+select ST_ASTEXT(unnested_polygons.the_geom) from unnested_polygons;
--- a/src/pg/test/sql/90_permissions.sql
+++ b/src/pg/test/sql/90_permissions.sql
@@ -9,7 +9,7 @@ SET search_path TO public,cartodb,cdb_crankshaft;
 -- Exercise public functions
 SELECT ppoints.code, m.quads
  FROM ppoints
-  JOIN CDB_AreasOfInterest_Local('ppoints', 'value') m
+  JOIN cdb_moran_local('ppoints', 'value') m
    ON ppoints.cartodb_id = m.ids
  ORDER BY ppoints.code;
 SELECT round(cdb_overlap_sum(
--- a/src/py/README.md
+++ b/src/py/README.md
@@ -8,7 +8,7 @@ cd crankshaft
 nosetests test/
 ```

-## Notes about Python dependencies
+## Notes about python dependencies
 * This extension is targeted at production databases. Therefore certain restrictions must be assumed about the production environment vs other experimental environments.
 * We're using `pip` and `virtualenv` to generate a suitable isolated environment for python code that has  all the dependencies
 * Every dependency should be:
--- a/src/py/crankshaft/crankshaft/init.py
+++ b/src/py/crankshaft/crankshaft/init.py
@@ -1,3 +1,2 @@
 import random_seeds
 import clustering
-import similarity
--- a/src/py/crankshaft/crankshaft/clustering/moran.py
+++ b/src/py/crankshaft/crankshaft/clustering/moran.py
@@ -5,226 +5,143 @@ Moran's I geostatistics (global clustering & outliers presence)
 # TODO: Fill in local neighbors which have null/NoneType values with the
 #       average of the their neighborhood

+import numpy as np
 import pysal as ps
 import plpy

-# crankshaft module
-import crankshaft.pysal_utils as pu
-
 # High level interface ---------------------------------------

-def moran(subquery, attr_name,
-          permutations, geom_col, id_col, w_type, num_ngbrs):
-    """
-    Moran's I (global)
-    Implementation building neighbors with a PostGIS database and Moran's I
-     core clusters with PySAL.
-    Andy Eschbacher
-    """
-    qvals = {"id_col": id_col,
-             "attr1": attr_name,
-             "geom_col": geom_col,
-             "subquery": subquery,
-             "num_ngbrs": num_ngbrs}
-
-    query = pu.construct_neighbor_query(w_type, qvals)
-
-    plpy.notice('** Query: %s' % query)
-
-    try:
-        result = plpy.execute(query)
-        # if there are no neighbors, exit
-        if len(result) == 0:
-            return pu.empty_zipped_array(2)
-        plpy.notice('** Query returned with %d rows' % len(result))
-    except plpy.SPIError:
-        plpy.error('Error: areas of interest query failed, check input parameters')
-        plpy.notice('** Query failed: "%s"' % query)
-        plpy.notice('** Error: %s' % plpy.SPIError)
-        return pu.empty_zipped_array(2)
-
-    ## collect attributes
-    attr_vals = pu.get_attributes(result)
-
-    ## calculate weights
-    weight = pu.get_weight(result, w_type, num_ngbrs)
-
-    ## calculate moran global
-    moran_global = ps.esda.moran.Moran(attr_vals, weight,
-                                       permutations=permutations)
-
-    return zip([moran_global.I], [moran_global.EI])
-
-def moran_local(subquery, attr,
-                permutations, geom_col, id_col, w_type, num_ngbrs):
+def moran_local(subquery, attr, significance, num_ngbrs, permutations, geom_column, id_col, w_type):
    """
    Moran's I implementation for PL/Python
    Andy Eschbacher
    """
+    # TODO: ensure that the significance output can be smaller that 1e-3 (0.001)
+    # TODO: make a wishlist of output features (zscores, pvalues, raw local lisa, what else?)
+
+    plpy.notice('** Constructing query')

    # geometries with attributes that are null are ignored
    # resulting in a collection of not as near neighbors

    qvals = {"id_col": id_col,
-             "attr1": attr,
-             "geom_col": geom_col,
+            "attr1": attr,
+            "geom_col": geom_column,
             "subquery": subquery,
             "num_ngbrs": num_ngbrs}

-    query = pu.construct_neighbor_query(w_type, qvals)
+    q = get_query(w_type, qvals)

    try:
-        result = plpy.execute(query)
-        # if there are no neighbors, exit
-        if len(result) == 0:
-            return pu.empty_zipped_array(5)
+        r = plpy.execute(q)
+        plpy.notice('** Query returned with %d rows' % len(r))
    except plpy.SPIError:
-        plpy.error('Error: areas of interest query failed, check input parameters')
-        plpy.notice('** Query failed: "%s"' % query)
-        return pu.empty_zipped_array(5)
+        plpy.notice('** Query failed: "%s"' % q)
+        plpy.notice('** Exiting function')
+        return zip([None], [None], [None], [None])

-    attr_vals = pu.get_attributes(result)
-    weight = pu.get_weight(result, w_type, num_ngbrs)
+    y = get_attributes(r, 1)
+    w = get_weight(r, w_type)

    # calculate LISA values
-    lisa = ps.esda.moran.Moran_Local(attr_vals, weight,
-                                     permutations=permutations)
-
-    # find quadrants for each geometry
-    quads = quad_position(lisa.q)
-
-    return zip(lisa.Is, quads, lisa.p_sim, weight.id_order, lisa.y)
-
-def moran_rate(subquery, numerator, denominator,
-               permutations, geom_col, id_col, w_type, num_ngbrs):
-    """
-    Moran's I Rate (global)
-    Andy Eschbacher
-    """
-    qvals = {"id_col": id_col,
-             "attr1": numerator,
-             "attr2": denominator,
-             "geom_col": geom_col,
-             "subquery": subquery,
-             "num_ngbrs": num_ngbrs}
-
-    query = pu.construct_neighbor_query(w_type, qvals)
-
-    plpy.notice('** Query: %s' % query)
-
-    try:
-        result = plpy.execute(query)
-        # if there are no neighbors, exit
-        if len(result) == 0:
-            return pu.empty_zipped_array(2)
-        plpy.notice('** Query returned with %d rows' % len(result))
-    except plpy.SPIError:
-        plpy.error('Error: areas of interest query failed, check input parameters')
-        plpy.notice('** Query failed: "%s"' % query)
-        plpy.notice('** Error: %s' % plpy.SPIError)
-        return pu.empty_zipped_array(2)
-
-    ## collect attributes
-    numer = pu.get_attributes(result, 1)
-    denom = pu.get_attributes(result, 2)
-
-    weight = pu.get_weight(result, w_type, num_ngbrs)
-
-    ## calculate moran global rate
-    lisa_rate = ps.esda.moran.Moran_Rate(numer, denom, weight,
-                                         permutations=permutations)
-
-    return zip([lisa_rate.I], [lisa_rate.EI])
-
-def moran_local_rate(subquery, numerator, denominator,
-                     permutations, geom_col, id_col, w_type, num_ngbrs):
-    """
-        Moran's I Local Rate
-        Andy Eschbacher
-    """
-    # geometries with values that are null are ignored
-    # resulting in a collection of not as near neighbors
-
-    query = pu.construct_neighbor_query(w_type,
-                                     {"id_col": id_col,
-                                      "numerator": numerator,
-                                      "denominator": denominator,
-                                      "geom_col": geom_col,
-                                      "subquery": subquery,
-                                      "num_ngbrs": num_ngbrs})
-
-    try:
-        result = plpy.execute(query)
-        # if there are no neighbors, exit
-        if len(result) == 0:
-            return pu.empty_zipped_array(5)
-    except plpy.SPIError:
-        plpy.error('Error: areas of interest query failed, check input parameters')
-        plpy.notice('** Query failed: "%s"' % query)
-        plpy.notice('** Error: %s' % plpy.SPIError)
-        return pu.empty_zipped_array(5)
-
-    ## collect attributes
-    numer = pu.get_attributes(result, 1)
-    denom = pu.get_attributes(result, 2)
-
-    weight = pu.get_weight(result, w_type, num_ngbrs)
-
-    # calculate LISA values
-    lisa = ps.esda.moran.Moran_Local_Rate(numer, denom, weight,
-                                          permutations=permutations)
+    lisa = ps.Moran_Local(y, w)

    # find units of significance
-    quads = quad_position(lisa.q)
+    lisa_sig = lisa_sig_vals(lisa.p_sim, lisa.q, significance)

-    return zip(lisa.Is, quads, lisa.p_sim, weight.id_order, lisa.y)
+    plpy.notice('** Finished calculations')

-def moran_local_bv(subquery, attr1, attr2,
-                   permutations, geom_col, id_col, w_type, num_ngbrs):
+    return zip(lisa.Is, lisa_sig, lisa.p_sim, w.id_order)
+
+
+def moran_local_rate(subquery, numerator, denominator, significance, num_ngbrs, permutations, geom_column, id_col, w_type):
    """
-        Moran's I (local) Bivariate (untested)
+    Moran's I Local Rate
+    Andy Eschbacher
    """
+
+    plpy.notice('** Constructing query')
+
+    # geometries with attributes that are null are ignored
+    # resulting in a collection of not as near neighbors
+
+    qvals = {"id_col": id_col,
+             "numerator": numerator,
+             "denominator": denominator,
+             "geom_col": geom_column,
+             "subquery": subquery,
+             "num_ngbrs": num_ngbrs}
+
+    q = get_query(w_type, qvals)
+
+    try:
+        r = plpy.execute(q)
+        plpy.notice('** Query returned with %d rows' % len(r))
+    except plpy.SPIError:
+        plpy.notice('** Query failed: "%s"' % q)
+        plpy.notice('** Error: %s' % plpy.SPIError)
+        plpy.notice('** Exiting function')
+        return zip([None], [None], [None], [None])
+
+        plpy.notice('r.nrows() = %d' % r.nrows())
+
+    ## collect attributes
+    numer = get_attributes(r, 1)
+    denom = get_attributes(r, 2)
+
+    w = get_weight(r, w_type, num_ngbrs)
+
+    # calculate LISA values
+    lisa = ps.esda.moran.Moran_Local_Rate(numer, denom, w, permutations=permutations)
+
+    # find units of significance
+    lisa_sig = lisa_sig_vals(lisa.p_sim, lisa.q, significance)
+
+    plpy.notice('** Finished calculations')
+
+    ## TODO: Decide on which return values here
+    return zip(lisa.Is, lisa_sig, lisa.p_sim, w.id_order, lisa.y)
+
+def moran_local_bv(t, attr1, attr2, significance, num_ngbrs, permutations, geom_column, id_col, w_type):
    plpy.notice('** Constructing query')

    qvals = {"num_ngbrs": num_ngbrs,
             "attr1": attr1,
             "attr2": attr2,
-             "subquery": subquery,
-             "geom_col": geom_col,
+             "table": t,
+             "geom_col": geom_column,
             "id_col": id_col}

-    query = pu.construct_neighbor_query(w_type, qvals)
+    q = get_query(w_type, qvals)

    try:
-        result = plpy.execute(query)
-        # if there are no neighbors, exit
-        if len(result) == 0:
-            return pu.empty_zipped_array(4)
+        r = plpy.execute(q)
+        plpy.notice('** Query returned with %d rows' % len(r))
    except plpy.SPIError:
-        plpy.error("Error: areas of interest query failed, " \
-                   "check input parameters")
-        plpy.notice('** Query failed: "%s"' % query)
-        return pu.empty_zipped_array(4)
+        plpy.notice('** Query failed: "%s"' % q)
+        plpy.notice('** Error: %s' % plpy.SPIError)
+        plpy.notice('** Exiting function')
+        return zip([None], [None], [None], [None])

    ## collect attributes
-    attr1_vals = pu.get_attributes(result, 1)
-    attr2_vals = pu.get_attributes(result, 2)
+    attr1_vals = get_attributes(r, 1)
+    attr2_vals = get_attributes(r, 2)

    # create weights
-    weight = pu.get_weight(result, w_type, num_ngbrs)
+    w = get_weight(r, w_type, num_ngbrs)

    # calculate LISA values
-    lisa = ps.esda.moran.Moran_Local_BV(attr1_vals, attr2_vals, weight,
-                                        permutations=permutations)
+    lisa = ps.esda.moran.Moran_Local_BV(attr1_vals, attr2_vals, w)

    plpy.notice("len of Is: %d" % len(lisa.Is))

    # find clustering of significance
-    lisa_sig = quad_position(lisa.q)
+    lisa_sig = lisa_sig_vals(lisa.p_sim, lisa.q, significance)

    plpy.notice('** Finished calculations')

-    return zip(lisa.Is, lisa_sig, lisa.p_sim, weight.id_order)
+    return zip(lisa.Is, lisa_sig, lisa.p_sim, w.id_order)
+

 # Low level functions ----------------------------------------

@@ -233,9 +150,7 @@ def map_quads(coord):
        Map a quadrant number to Moran's I designation
        HH=1, LH=2, LL=3, HL=4
        Input:
-        @param coord (int): quadrant of a specific measurement
-        Output:
-            classification (one of 'HH', 'LH', 'LL', or 'HL')
+        :param coord (int): quadrant of a specific measurement
    """
    if coord == 1:
        return 'HH'
@@ -248,13 +163,159 @@ def map_quads(coord):
    else:
        return None

+def query_attr_select(params):
+    """
+        Create portion of SELECT statement for attributes inolved in query.
+        :param params: dict of information used in query (column names,
+                       table name, etc.)
+    """
+
+    attrs = [k for k in params
+             if k not in ('id_col', 'geom_col', 'table', 'num_ngbrs', 'subquery')]
+
+    template = "i.\"{%(col)s}\"::numeric As attr%(alias_num)s, "
+
+    attr_string = ""
+
+    for idx, val in enumerate(sorted(attrs)):
+        attr_string += template % {"col": val, "alias_num": idx + 1}
+
+    return attr_string
+
+def query_attr_where(params):
+    """
+        Create portion of WHERE clauses for weeding out NULL-valued geometries
+    """
+    attrs = sorted([k for k in params
+                    if k not in ('id_col', 'geom_col', 'table', 'num_ngbrs', 'subquery')])
+
+    attr_string = []
+
+    for attr in attrs:
+        attr_string.append("idx_replace.\"{%s}\" IS NOT NULL" % attr)
+
+    if len(attrs) == 2:
+        attr_string.append("idx_replace.\"{%s}\" <> 0" % attrs[1])
+
+    out = " AND ".join(attr_string)
+
+    return out
+
+def knn(params):
+    """SQL query for k-nearest neighbors.
+        :param vars: dict of values to fill template
+    """
+
+    attr_select = query_attr_select(params)
+    attr_where = query_attr_where(params)
+
+    replacements = {"attr_select": attr_select,
+                    "attr_where_i": attr_where.replace("idx_replace", "i"),
+                    "attr_where_j": attr_where.replace("idx_replace", "j")}
+
+    query = "SELECT " \
+                "i.\"{id_col}\" As id, " \
+                "%(attr_select)s" \
+                "(SELECT ARRAY(SELECT j.\"{id_col}\" " \
+                              "FROM \"({subquery})\" As j " \
+                              "WHERE %(attr_where_j)s " \
+                              "ORDER BY j.\"{geom_col}\" <-> i.\"{geom_col}\" ASC " \
+                              "LIMIT {num_ngbrs} OFFSET 1 ) " \
+                ") As neighbors " \
+            "FROM \"({subquery})\" As i " \
+            "WHERE " \
+                "%(attr_where_i)s " \
+            "ORDER BY i.\"{id_col}\" ASC;" % replacements
+
+    return query.format(**params)
+
+## SQL query for finding queens neighbors (all contiguous polygons)
+def queen(params):
+    """SQL query for queen neighbors.
+        :param params: dict of information to fill query
+    """
+    attr_select = query_attr_select(params)
+    attr_where = query_attr_where(params)
+
+    replacements = {"attr_select": attr_select,
+                    "attr_where_i": attr_where.replace("idx_replace", "i"),
+                    "attr_where_j": attr_where.replace("idx_replace", "j")}
+
+    query = "SELECT " \
+                "i.\"{id_col}\" As id, " \
+                "%(attr_select)s" \
+                "(SELECT ARRAY(SELECT j.\"{id_col}\" " \
+                 "FROM \"({subquery})\" As j " \
+                 "WHERE ST_Touches(i.\"{geom_col}\", j.\"{geom_col}\") AND " \
+                 "%(attr_where_j)s)" \
+                ") As neighbors " \
+            "FROM \"({subquery})\" As i " \
+            "WHERE " \
+                "%(attr_where_i)s " \
+            "ORDER BY i.\"{id_col}\" ASC;" % replacements
+
+    return query.format(**params)
+
+## to add more weight methods open a ticket or pull request
+
+def get_query(w_type, query_vals):
+    """Return requested query.
+        :param w_type: type of neighbors to calculate (knn or queen)
+        :param query_vals: values used to construct the query
+    """
+
+    if w_type == 'knn':
+        return knn(query_vals)
+    else:
+        return queen(query_vals)
+
+def get_attributes(query_res, attr_num):
+    """
+        :param query_res: query results with attributes and neighbors
+        :param attr_num: attribute number (1, 2, ...)
+    """
+    return np.array([x['attr' + str(attr_num)] for x in query_res], dtype=np.float)
+
+## Build weight object
+def get_weight(query_res, w_type='queen', num_ngbrs=5):
+    """
+        Construct PySAL weight from return value of query
+        :param query_res: query results with attributes and neighbors
+    """
+    if w_type == 'knn':
+        row_normed_weights = [1.0 / float(num_ngbrs)] * num_ngbrs
+        weights = {x['id']: row_normed_weights for x in query_res}
+    elif w_type == 'queen':
+        weights = {x['id']: [1.0 / len(x['neighbors'])] * len(x['neighbors'])
+                            if len(x['neighbors']) > 0
+                            else [] for x in query_res}
+
+    neighbors = {x['id']: x['neighbors'] for x in query_res}
+
+    return ps.W(neighbors, weights)
+
 def quad_position(quads):
    """
        Produce Moran's I classification based of n
-        Input:
-        @param quads ndarray: an array of quads classified by
-          1-4 (PySAL default)
-        Output:
-        @param list: an array of quads classied by 'HH', 'LL', etc.
    """
-    return [map_quads(q) for q in quads]
+
+    lisa_sig = np.array([map_quads(q) for q in quads])
+
+    return lisa_sig
+
+def lisa_sig_vals(pvals, quads, threshold):
+    """
+        Produce Moran's I classification based of n
+    """
+
+    sig = (pvals <= threshold)
+
+    lisa_sig = np.empty(len(sig), np.chararray)
+
+    for idx, val in enumerate(sig):
+        if val:
+            lisa_sig[idx] = map_quads(quads[idx])
+        else:
+            lisa_sig[idx] = 'Not significant'
+
+    return lisa_sig
--- a/src/py/crankshaft/crankshaft/pysal_utils/init.py
+++ b/src/py/crankshaft/crankshaft/pysal_utils/init.py
@@ -1 +0,0 @@
-from pysal_utils import *
--- a/src/py/crankshaft/crankshaft/pysal_utils/pysal_utils.py
+++ b/src/py/crankshaft/crankshaft/pysal_utils/pysal_utils.py
@@ -1,152 +0,0 @@
-"""
-    Utilities module for generic PySAL functionality, mainly centered on translating queries into numpy arrays or PySAL weights objects
-"""
-
-import numpy as np
-import pysal as ps
-
-def construct_neighbor_query(w_type, query_vals):
-    """Return query (a string) used for finding neighbors
-        @param w_type text: type of neighbors to calculate ('knn' or 'queen')
-        @param query_vals dict: values used to construct the query
-    """
-
-    if w_type == 'knn':
-        return knn(query_vals)
-    else:
-        return queen(query_vals)
-
-## Build weight object
-def get_weight(query_res, w_type='knn', num_ngbrs=5):
-    """
-        Construct PySAL weight from return value of query
-        @param query_res: query results with attributes and neighbors
-    """
-    if w_type == 'knn':
-        row_normed_weights = [1.0 / float(num_ngbrs)] * num_ngbrs
-        weights = {x['id']: row_normed_weights for x in query_res}
-    else:
-        weights = {x['id']: [1.0 / len(x['neighbors'])] * len(x['neighbors'])
-                            if len(x['neighbors']) > 0
-                            else [] for x in query_res}
-
-    neighbors = {x['id']: x['neighbors'] for x in query_res}
-
-    return ps.W(neighbors, weights)
-
-def query_attr_select(params):
-    """
-        Create portion of SELECT statement for attributes inolved in query.
-        @param params: dict of information used in query (column names,
-                       table name, etc.)
-    """
-
-    attrs = [k for k in params
-             if k not in ('id_col', 'geom_col', 'subquery', 'num_ngbrs')]
-
-    template = "i.\"{%(col)s}\"::numeric As attr%(alias_num)s, "
-
-    attr_string = ""
-
-    for idx, val in enumerate(sorted(attrs)):
-        attr_string += template % {"col": val, "alias_num": idx + 1}
-
-    return attr_string
-
-def query_attr_where(params):
-    """
-        Create portion of WHERE clauses for weeding out NULL-valued geometries
-    """
-    attrs = sorted([k for k in params
-                    if k not in ('id_col', 'geom_col', 'subquery', 'num_ngbrs')])
-
-    attr_string = []
-
-    for attr in attrs:
-        attr_string.append("idx_replace.\"{%s}\" IS NOT NULL" % attr)
-
-    if len(attrs) == 2:
-        attr_string.append("idx_replace.\"{%s}\" <> 0" % attrs[1])
-
-    out = " AND ".join(attr_string)
-
-    return out
-
-def knn(params):
-    """SQL query for k-nearest neighbors.
-        @param vars: dict of values to fill template
-    """
-
-    attr_select = query_attr_select(params)
-    attr_where = query_attr_where(params)
-
-    replacements = {"attr_select": attr_select,
-                    "attr_where_i": attr_where.replace("idx_replace", "i"),
-                    "attr_where_j": attr_where.replace("idx_replace", "j")}
-
-    query = "SELECT " \
-                "i.\"{id_col}\" As id, " \
-                "%(attr_select)s" \
-                "(SELECT ARRAY(SELECT j.\"{id_col}\" " \
-                              "FROM ({subquery}) As j " \
-                              "WHERE " \
-                                "i.\"{id_col}\" <> j.\"{id_col}\" AND " \
-                                "%(attr_where_j)s " \
-                              "ORDER BY " \
-                                "j.\"{geom_col}\" <-> i.\"{geom_col}\" ASC " \
-                              "LIMIT {num_ngbrs})" \
-                ") As neighbors " \
-            "FROM ({subquery}) As i " \
-            "WHERE " \
-                "%(attr_where_i)s " \
-            "ORDER BY i.\"{id_col}\" ASC;" % replacements
-
-    return query.format(**params)
-
-## SQL query for finding queens neighbors (all contiguous polygons)
-def queen(params):
-    """SQL query for queen neighbors.
-        @param params dict: information to fill query
-    """
-    attr_select = query_attr_select(params)
-    attr_where = query_attr_where(params)
-
-    replacements = {"attr_select": attr_select,
-                    "attr_where_i": attr_where.replace("idx_replace", "i"),
-                    "attr_where_j": attr_where.replace("idx_replace", "j")}
-
-    query = "SELECT " \
-                "i.\"{id_col}\" As id, " \
-                "%(attr_select)s" \
-                "(SELECT ARRAY(SELECT j.\"{id_col}\" " \
-                 "FROM ({subquery}) As j " \
-                 "WHERE i.\"{id_col}\" <> j.\"{id_col}\" AND " \
-                       "ST_Touches(i.\"{geom_col}\", j.\"{geom_col}\") AND " \
-                       "%(attr_where_j)s)" \
-                ") As neighbors " \
-            "FROM ({subquery}) As i " \
-            "WHERE " \
-                "%(attr_where_i)s " \
-            "ORDER BY i.\"{id_col}\" ASC;" % replacements
-
-    return query.format(**params)
-
-## to add more weight methods open a ticket or pull request
-
-def get_attributes(query_res, attr_num=1):
-    """
-        @param query_res: query results with attributes and neighbors
-        @param attr_num: attribute number (1, 2, ...)
-    """
-    return np.array([x['attr' + str(attr_num)] for x in query_res], dtype=np.float)
-
-def empty_zipped_array(num_nones):
-    """
-        prepare return values for cases of empty weights objects (no neighbors)
-        Input:
-        @param num_nones int: number of columns (e.g., 4)
-        Output:
-        [(None, None, None, None)]
-    """
-
-    return [tuple([None] * num_nones)]
--- a/src/py/crankshaft/crankshaft/similarity/init.py
+++ b/src/py/crankshaft/crankshaft/similarity/init.py
@@ -1 +0,0 @@
-from similarity import * 
--- a/src/py/crankshaft/crankshaft/similarity/similarity.py
+++ b/src/py/crankshaft/crankshaft/similarity/similarity.py
@@ -1,91 +0,0 @@
-from sklearn.neighbors import NearestNeighbors
-import  scipy.stats as stats
-import numpy as np
-import plpy
-import time
-import cPickle
-
-
-def query_to_dictionary(result):
-    return [ dict(zip(r.keys(), r.values())) for r in result ]
-
-def drop_all_nan_columns(data):
-    return data[ :, ~np.isnan(data).all(axis=0)]
-    
-def fill_missing_na(data,val=None):
-    inds = np.where(np.isnan(data))
-    if val==None:
-        col_mean = stats.nanmean(data,axis=0)
-        data[inds]=np.take(col_mean,inds[1])
-    else:
-        data[inds]=np.take(val, inds[1])
-    return data
-    
-def similarity_rank(target_cartodb_id, query):
-    start_time  = time.time() 
-    #plpy.notice('converting to dictionary ', start_time) 
-    #data = query_to_dictionary(plpy.execute(query))  
-    plpy.notice('coverted , running query ', time.time() - start_time) 
-    
-    data = plpy.execute(query_only_values(query))
-    plpy.notice('run query  , getting cartodb_idsi', time.time() - start_time)
-    cartodb_ids = plpy.execute(query_cartodb_id(query))[0]['a']
-    target_id  = cartodb_ids.index(target_cartodb_id)
-    plpy.notice('run query  , extracting ', time.time() - start_time)
-    features, target = extract_features_target(data,target_id)
-    plpy.notice('extracted  , cleaning ', time.time() - start_time)
-    features = fill_missing_na(drop_all_nan_columns(features))
-    plpy.notice('cleaned , normalizing', start_time - time.time())
-    
-    normed_features, normed_target  = normalize_features(features,target)
-    plpy.notice('normalized , training ', time.time() - start_time )
-    tree = train(normed_features)
-    plpy.notice('normalized , pickling ', time.time() - start_time )
-    #plpy.notice('tree_dump ',  len(cPickle.dumps(tree, protocol=cPickle.HIGHEST_PROTOCOL)))
-    plpy.notice('pickles, querying ', time.time() - start_time)
-    dist, ind  = tree.kneighbors(normed_target)
-    plpy.notice('queried , rectifying', time.time() - start_time)
-    return zip(cartodb_ids, dist[0])
-
-def query_cartodb_id(query):
-    return 'select array_agg(cartodb_id) a from ({0}) b'.format(query)
-
-def query_only_values(query):
-    first_row = plpy.execute('select * from ({query}) a limit 1'.format(query=query))
-    just_values = ','.join([ key for key in  first_row[0].keys()  if key not in ['the_geom', 'the_geom_webmercator','cartodb_id']])
-    return 'select Array[{0}] a from ({1}) b '.format(just_values, query)
-
-
-def most_similar(matches,query):
-    data = plpy.execute(query)    
-    features, _ = extract_features_target(data)
-    results = []
-    for i in features:
-        target = features
-        dist,ind = tree.query(target, k=matches)
-        cartodb_ids  = [ dist[ind]['cartodb_id'] for index in ind ]
-        results.append(cartodb_ids)
-    return cartodb_ids, results
-    
-    
-def train(features):
-    tree = NearestNeighbors( n_neighbors=len(features), algorithm='auto').fit(features)
-    return tree
-    
-def normalize_features(features, target):
-    maxes = features.max(axis=0)
-    mins  = features.min(axis=0)
-    return (features - mins)/(maxes-mins), (target-mins)/(maxes-mins)
- 
-def extract_row(row):
-    keys = row.keys()
-    values = row.values()
-    del values[ keys.index('cartodb_id')]
-    return values
-
-def extract_features_target(data, target_index=None):
-    target   = None
-    features = [row['a'] for row in data]
-    target   = features[target_index]
-    return np.array(features, dtype=float), np.array(target, dtype=float)
-    
--- a/src/py/crankshaft/setup.py
+++ b/src/py/crankshaft/setup.py
@@ -40,9 +40,9 @@ setup(

    # The choice of component versions is dictated by what's
    # provisioned in the production servers.
-    install_requires=['pysal==1.9.1', 'scikit-learn==0.17.1'],
+    install_requires=['pysal==1.9.1'],

-    requires=['pysal', 'numpy','sklearn'],
+    requires=['pysal', 'numpy' ],

    test_suite='test'
 )
--- a/src/py/crankshaft/test/fixtures/moran.json
+++ b/src/py/crankshaft/test/fixtures/moran.json
@@ -1,52 +1,52 @@
 [[0.9319096128346788, "HH"],
 [-1.135787401862846, "HL"],
-[0.11732030672508517, "LL"],
-[0.6152779669180425, "LL"],
-[-0.14657336660125297, "LH"],
-[0.6967858120189607, "LL"],
-[0.07949310115714454, "HH"],
-[0.4703198759258987, "HH"],
-[0.4421125200498064, "HH"],
-[0.5724288737143592, "LL"],
+[0.11732030672508517, "Not significant"],
+[0.6152779669180425, "Not significant"],
+[-0.14657336660125297, "Not significant"],
+[0.6967858120189607, "Not significant"],
+[0.07949310115714454, "Not significant"],
+[0.4703198759258987, "Not significant"],
+[0.4421125200498064, "Not significant"],
+[0.5724288737143592, "Not significant"],
 [0.8970743435692062, "LL"],
-[0.18327334401918674, "LL"],
-[-0.01466729201304962, "HL"],
-[0.3481559372544409, "LL"],
-[0.06547094736902978, "LL"],
+[0.18327334401918674, "Not significant"],
+[-0.01466729201304962, "Not significant"],
+[0.3481559372544409, "Not significant"],
+[0.06547094736902978, "Not significant"],
 [0.15482141569329988, "HH"],
-[0.4373841193538136, "HH"],
-[0.15971286468915544, "LL"],
-[1.0543588860308968, "HH"],
+[0.4373841193538136, "Not significant"],
+[0.15971286468915544, "Not significant"],
+[1.0543588860308968, "Not significant"],
 [1.7372866900020818, "HH"],
 [1.091998586053999, "LL"],
-[0.1171572584252222, "HH"],
-[0.08438455015300014, "LL"],
-[0.06547094736902978, "LL"],
+[0.1171572584252222, "Not significant"],
+[0.08438455015300014, "Not significant"],
+[0.06547094736902978, "Not significant"],
 [0.15482141569329985, "HH"],
 [1.1627044812890683, "HH"],
-[0.06547094736902978, "LL"],
-[0.795275137550483, "HH"],
+[0.06547094736902978, "Not significant"],
+[0.795275137550483, "Not significant"],
 [0.18562939195219, "LL"],
-[0.3010757406693439, "LL"],
+[0.3010757406693439, "Not significant"],
 [2.8205795942839376, "HH"],
-[0.11259190602909264, "LL"],
-[-0.07116352791516614, "HL"],
-[-0.09945240794119009, "LH"],
+[0.11259190602909264, "Not significant"],
+[-0.07116352791516614, "Not significant"],
+[-0.09945240794119009, "Not significant"],
 [0.18562939195219, "LL"],
-[0.1832733440191868, "LL"],
-[-0.39054253768447705, "HL"],
+[0.1832733440191868, "Not significant"],
+[-0.39054253768447705, "Not significant"],
 [-0.1672071289487642, "HL"],
-[0.3337669247916343, "HH"],
-[0.2584386102554792, "HH"],
+[0.3337669247916343, "Not significant"],
+[0.2584386102554792, "Not significant"],
 [-0.19733845476322634, "HL"],
 [-0.9379282899805409, "LH"],
-[-0.028770969951095866, "LH"],
-[0.051367269430983485, "LL"],
+[-0.028770969951095866, "Not significant"],
+[0.051367269430983485, "Not significant"],
 [-0.2172548045913472, "LH"],
-[0.05136726943098351, "LL"],
-[0.04191046803899837, "LL"],
+[0.05136726943098351, "Not significant"],
+[0.04191046803899837, "Not significant"],
 [0.7482357030403517, "HH"],
-[-0.014585767863118111, "LH"],
-[0.5410013139159929, "HH"],
+[-0.014585767863118111, "Not significant"],
+[0.5410013139159929, "Not significant"],
 [1.0223932668429925, "LL"],
-[1.4179402898927476, "LL"]]
+[1.4179402898927476, "LL"]]
--- a/src/py/crankshaft/test/test_clustering_moran.py
+++ b/src/py/crankshaft/test/test_clustering_moran.py
@@ -1,6 +1,8 @@
 import unittest
 import numpy as np

+import unittest
+

 # from mock_plpy import MockPlPy
 # plpy = MockPlPy()
@@ -10,12 +12,11 @@ import numpy as np
 from helper import plpy, fixture_file

 import crankshaft.clustering as cc
-import crankshaft.pysal_utils as pu
 from crankshaft import random_seeds
 import json

 class MoranTest(unittest.TestCase):
-    """Testing class for Moran's I functions"""
+    """Testing class for Moran's I functions."""

    def setUp(self):
        plpy._reset()
@@ -29,7 +30,7 @@ class MoranTest(unittest.TestCase):
        self.moran_data = json.loads(open(fixture_file('moran.json')).read())

    def test_map_quads(self):
-        """Test map_quads"""
+        """Test map_quads."""
        self.assertEqual(cc.map_quads(1), 'HH')
        self.assertEqual(cc.map_quads(2), 'LH')
        self.assertEqual(cc.map_quads(3), 'LL')
@@ -37,8 +38,80 @@ class MoranTest(unittest.TestCase):
        self.assertEqual(cc.map_quads(33), None)
        self.assertEqual(cc.map_quads('andy'), None)

+    def test_query_attr_select(self):
+        """Test query_attr_select."""
+
+        ans = "i.\"{attr1}\"::numeric As attr1, " \
+              "i.\"{attr2}\"::numeric As attr2, "
+
+        self.assertEqual(cc.query_attr_select(self.params), ans)
+
+    def test_query_attr_where(self):
+        """Test query_attr_where."""
+
+        ans = "idx_replace.\"{attr1}\" IS NOT NULL AND "\
+              "idx_replace.\"{attr2}\" IS NOT NULL AND "\
+              "idx_replace.\"{attr2}\" <> 0"
+
+        self.assertEqual(cc.query_attr_where(self.params), ans)
+
+    def test_knn(self):
+        """Test knn function."""
+
+        ans = "SELECT i.\"cartodb_id\" As id, i.\"andy\"::numeric As attr1, " \
+              "i.\"jay_z\"::numeric As attr2, (SELECT ARRAY(SELECT j.\"cartodb_id\" " \
+              "FROM \"(SELECT * FROM a_list)\" As j WHERE j.\"andy\" IS NOT NULL AND " \
+              "j.\"jay_z\" IS NOT NULL AND j.\"jay_z\" <> 0 ORDER BY " \
+              "j.\"the_geom\" <-> i.\"the_geom\" ASC LIMIT 321 OFFSET 1 ) ) " \
+              "As neighbors FROM \"(SELECT * FROM a_list)\" As i WHERE i.\"andy\" IS NOT " \
+              "NULL AND i.\"jay_z\" IS NOT NULL AND i.\"jay_z\" <> 0 ORDER " \
+              "BY i.\"cartodb_id\" ASC;"
+
+        self.assertEqual(cc.knn(self.params), ans)
+
+    def test_queen(self):
+        """Test queen neighbors function."""
+
+        ans = "SELECT i.\"cartodb_id\" As id, i.\"andy\"::numeric As attr1, " \
+              "i.\"jay_z\"::numeric As attr2, (SELECT ARRAY(SELECT " \
+              "j.\"cartodb_id\" FROM \"(SELECT * FROM a_list)\" As j WHERE ST_Touches(" \
+              "i.\"the_geom\", j.\"the_geom\") AND j.\"andy\" IS NOT NULL " \
+              "AND j.\"jay_z\" IS NOT NULL AND j.\"jay_z\" <> 0)) As " \
+              "neighbors FROM \"(SELECT * FROM a_list)\" As i WHERE i.\"andy\" IS NOT NULL " \
+              "AND i.\"jay_z\" IS NOT NULL AND i.\"jay_z\" <> 0 ORDER BY " \
+              "i.\"cartodb_id\" ASC;"
+
+        self.assertEqual(cc.queen(self.params), ans)
+
+    def test_get_query(self):
+        """Test get_query."""
+
+        ans = "SELECT i.\"cartodb_id\" As id, i.\"andy\"::numeric As attr1, " \
+              "i.\"jay_z\"::numeric As attr2, (SELECT ARRAY(SELECT " \
+              "j.\"cartodb_id\" FROM \"(SELECT * FROM a_list)\" As j WHERE j.\"andy\" IS " \
+              "NOT NULL AND j.\"jay_z\" IS NOT NULL AND j.\"jay_z\" <> 0 " \
+              "ORDER BY j.\"the_geom\" <-> i.\"the_geom\" ASC LIMIT 321 " \
+              "OFFSET 1 ) ) As neighbors FROM \"(SELECT * FROM a_list)\" As i WHERE " \
+              "i.\"andy\" IS NOT NULL AND i.\"jay_z\" IS NOT NULL AND " \
+              "i.\"jay_z\" <> 0 ORDER BY i.\"cartodb_id\" ASC;"
+
+        self.assertEqual(cc.get_query('knn', self.params), ans)
+
+    def test_get_attributes(self):
+        """Test get_attributes."""
+
+        ## need to add tests
+
+        self.assertEqual(True, True)
+
+    def test_get_weight(self):
+        """Test get_weight."""
+
+        self.assertEqual(True, True)
+
+
    def test_quad_position(self):
-        """Test lisa_sig_vals"""
+        """Test lisa_sig_vals."""

        quads = np.array([1, 2, 3, 4], np.int)

@@ -52,7 +125,7 @@ class MoranTest(unittest.TestCase):
        data = [ { 'id': d['id'], 'attr1': d['value'], 'neighbors': d['neighbors'] } for d in self.neighbors_data]
        plpy._define_result('select', data)
        random_seeds.set_random_seeds(1234)
-        result = cc.moran_local('subquery', 'value', 99, 'the_geom', 'cartodb_id', 'knn', 5)
+        result = cc.moran_local('table', 'value', 0.05, 5, 99, 'the_geom', 'cartodb_id', 'knn')
        result = [(row[0], row[1]) for row in result]
        expected = self.moran_data
        for ([res_val, res_quad], [exp_val, exp_quad]) in zip(result, expected):
@@ -64,20 +137,8 @@ class MoranTest(unittest.TestCase):
        data = [ { 'id': d['id'], 'attr1': d['value'], 'attr2': 1, 'neighbors': d['neighbors'] } for d in self.neighbors_data]
        plpy._define_result('select', data)
        random_seeds.set_random_seeds(1234)
-        result = cc.moran_local_rate('subquery', 'numerator', 'denominator', 99, 'the_geom', 'cartodb_id', 'knn', 5)
-        print 'result == None? ', result == None
+        result = cc.moran_local_rate('table', 'numerator', 'denominator', 0.05, 5, 99, 'the_geom', 'cartodb_id', 'knn')
        result = [(row[0], row[1]) for row in result]
        expected = self.moran_data
        for ([res_val, res_quad], [exp_val, exp_quad]) in zip(result, expected):
            self.assertAlmostEqual(res_val, exp_val)
-
-    def test_moran(self):
-        """Test Moran's I global"""
-        data = [{ 'id': d['id'], 'attr1': d['value'], 'neighbors': d['neighbors'] } for d in self.neighbors_data]
-        plpy._define_result('select', data)
-        random_seeds.set_random_seeds(1235)
-        result = cc.moran('table', 'value', 99, 'the_geom', 'cartodb_id', 'knn', 5)
-        print 'result == None?', result == None
-        result_moran = result[0][0]
-        expected_moran = np.array([row[0] for row in self.moran_data]).mean()
-        self.assertAlmostEqual(expected_moran, result_moran, delta=10e-2)
--- a/src/py/crankshaft/test/test_pysal_utils.py
+++ b/src/py/crankshaft/test/test_pysal_utils.py
@@ -1,107 +0,0 @@
-import unittest
-
-import crankshaft.pysal_utils as pu
-from crankshaft import random_seeds
-
-
-class PysalUtilsTest(unittest.TestCase):
-    """Testing class for utility functions related to PySAL integrations"""
-
-    def setUp(self):
-        self.params = {"id_col": "cartodb_id",
-                       "attr1": "andy",
-                       "attr2": "jay_z",
-                       "subquery": "SELECT * FROM a_list",
-                       "geom_col": "the_geom",
-                       "num_ngbrs": 321}
-
-    def test_query_attr_select(self):
-        """Test query_attr_select"""
-
-        ans = "i.\"{attr1}\"::numeric As attr1, " \
-              "i.\"{attr2}\"::numeric As attr2, "
-
-        self.assertEqual(pu.query_attr_select(self.params), ans)
-
-    def test_query_attr_where(self):
-        """Test pu.query_attr_where"""
-
-        ans = "idx_replace.\"{attr1}\" IS NOT NULL AND " \
-              "idx_replace.\"{attr2}\" IS NOT NULL AND " \
-              "idx_replace.\"{attr2}\" <> 0"
-
-        self.assertEqual(pu.query_attr_where(self.params), ans)
-
-    def test_knn(self):
-        """Test knn neighbors constructor"""
-
-        ans = "SELECT i.\"cartodb_id\" As id, " \
-                     "i.\"andy\"::numeric As attr1, " \
-                     "i.\"jay_z\"::numeric As attr2, " \
-                     "(SELECT ARRAY(SELECT j.\"cartodb_id\" " \
-                                   "FROM (SELECT * FROM a_list) As j " \
-                                   "WHERE " \
-                                    "i.\"cartodb_id\" <> j.\"cartodb_id\" AND " \
-                                    "j.\"andy\" IS NOT NULL AND " \
-                                    "j.\"jay_z\" IS NOT NULL AND " \
-                                    "j.\"jay_z\" <> 0 " \
-                                   "ORDER BY " \
-                                    "j.\"the_geom\" <-> i.\"the_geom\" ASC " \
-                      "LIMIT 321)) As neighbors " \
-              "FROM (SELECT * FROM a_list) As i " \
-              "WHERE i.\"andy\" IS NOT NULL AND " \
-                    "i.\"jay_z\" IS NOT NULL AND " \
-                    "i.\"jay_z\" <> 0 " \
-              "ORDER BY i.\"cartodb_id\" ASC;"
-
-        self.assertEqual(pu.knn(self.params), ans)
-
-    def test_queen(self):
-        """Test queen neighbors constructor"""
-
-        ans = "SELECT i.\"cartodb_id\" As id, " \
-                     "i.\"andy\"::numeric As attr1, " \
-                     "i.\"jay_z\"::numeric As attr2, " \
-                     "(SELECT ARRAY(SELECT j.\"cartodb_id\" " \
-                                   "FROM (SELECT * FROM a_list) As j " \
-                                   "WHERE " \
-                                   "i.\"cartodb_id\" <> j.\"cartodb_id\" AND " \
-                                   "ST_Touches(i.\"the_geom\", " \
-                                              "j.\"the_geom\") AND " \
-                                   "j.\"andy\" IS NOT NULL AND " \
-                                   "j.\"jay_z\" IS NOT NULL AND " \
-                                   "j.\"jay_z\" <> 0)" \
-                                  ") As neighbors " \
-              "FROM (SELECT * FROM a_list) As i " \
-              "WHERE i.\"andy\" IS NOT NULL AND " \
-                    "i.\"jay_z\" IS NOT NULL AND " \
-                    "i.\"jay_z\" <> 0 " \
-              "ORDER BY i.\"cartodb_id\" ASC;"
-
-        self.assertEqual(pu.queen(self.params), ans)
-
-    def test_construct_neighbor_query(self):
-        """Test construct_neighbor_query"""
-
-        # Compare to raw knn query
-        self.assertEqual(pu.construct_neighbor_query('knn', self.params),
-                         pu.knn(self.params))
-
-    def test_get_attributes(self):
-        """Test get_attributes"""
-
-        ## need to add tests
-
-        self.assertEqual(True, True)
-
-    def test_get_weight(self):
-        """Test get_weight"""
-
-        self.assertEqual(True, True)
-
-    def test_empty_zipped_array(self):
-        """Test empty_zipped_array"""
-        ans2 = [(None, None)]
-        ans4 = [(None, None, None, None)]
-        self.assertEqual(pu.empty_zipped_array(2), ans2)
-        self.assertEqual(pu.empty_zipped_array(4), ans4)