performance imporvments

adding sklearn to deps
fixing syntax
2016-05-27 19:31:37 +00:00 · 2016-05-27 14:59:24 +00:00 · 2016-05-27 14:58:43 +00:00 · 2016-05-27 14:58:05 +00:00 · 2016-05-27 10:33:00 -04:00 · 2016-05-27 10:29:47 -04:00
8 changed files with 239 additions and 6 deletions
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -45,8 +45,8 @@ source envs/dev/bin/activate

 Update extension in a working database with:

-* `ALTER EXTENSION crankshaft VERSION TO 'current';`
-  `ALTER EXTENSION crankshaft VERSION TO 'dev';`
+* `ALTER EXTENSION crankshaft UPDATE TO 'current';`
+  `ALTER EXTENSION crankshaft UPDATE TO 'dev';`

 Note: we keep the current development version install as 'dev' always;
 we update through the 'current' alias to allow changing the extension
@@ -58,7 +58,10 @@ should be dropped manually before the update.
 If the extension has not previously been installed in a database,
 it can be installed directly with:

-* `CREATE EXTENSION crankshaft WITH VERSION 'dev';`
+* `CREATE EXTENSION IF NOT EXISTS plpythonu;`
+  `CREATE EXTENSION IF NOT EXISTS postgis;`
+  `CREATE EXTENSION IF NOT EXISTS cartodb;`
+  `CREATE EXTENSION crankshaft WITH VERSION 'dev';`

 Note: the development extension uses the development python virtual
 environment automatically.
--- a/doc/02_moran.md
+++ b/doc/02_moran.md
@@ -1,4 +1,102 @@
-### Moran's I
+## Name
+
+CDB_AreasOfInterest -- returns a table with a cluster/outlier classification, the significance of a classification, an autocorrelation statistic (Local Moran's I), and the geometry id for each geometry in the original dataset.
+
+## Synopsis
+
+```sql
+table(numeric moran_val, text quadrant, numeric significance, int ids, numeric column_values) CDB_AreasOfInterest(text query, text column_name)
+
+table(numeric moran_val, text quadrant, numeric significance, int ids, numeric column_values) CDB_AreasOfInterest(text query, text column_name, int permutations, text geom_column, text id_column, text weight_type, int num_ngbrs)
+```
+
+## Description
+
+CDB_AreasOfInterest is a table-returning function that classifies the geometries in a table by an attribute and gives a significance for that classification. This information can be used to find "Areas of Interest" by using the correlation of a geometry's attribute with that of its neighbors. Areas can be clusters, outliers, or neither (depending on which significance value is used).
+
+Inputs:
+
+* `query` (required): an arbitrary query against tables you have access to (e.g., in your account, shared in your organization, or through the Data Observatory). This string must contain the following columns: an id `INT` (e.g., `cartodb_id`), geometry (e.g., `the_geom`), and the numeric attribute which is specified in `column_name`
+* `column_name` (required): column to perform the area of interest analysis tool on. The data must be numeric (e.g., `float`, `int`, etc.)
+* `permutations` (optional): used to calculate the significance of a classification. Defaults to 99, which is sufficient in most situations.
+* `geom_column` (optional): the name of the geometry column. Data must be of type `geometry`.
+* `id_column` (optional): the name of the id column (e.g., `cartodb_id`). Data must be of type `int` or `bigint` and have a unique condition on the data.
+* `weight_type` (optional): the type of weight used for determining what defines a neighborhood. Options are `knn` or `queen`.
+* `num_ngbrs` (optional): the number of neighbors in a neighborhood around a geometry. Only used if `knn` is chosen above.
+
+Outputs:
+
+* `moran_val`: underlying correlation statistic used in analysis
+* `quadrant`: human-readable interpretation of classification
+* `significance`: significance of classification (closer to 0 is more significant)
+* `ids`: id of original geometry (used for joining against original table if desired -- see examples)
+* `column_values`: original column values from `column_name`
+
+Availability: crankshaft v0.0.1 and above
+
+## Examples
+
+```sql
+SELECT
+  t.the_geom_webmercator,
+  t.cartodb_id,
+  aoi.significance,
+  aoi.quadrant As aoi_quadrant
+FROM
+  observatory.acs2013 As t
+JOIN
+  crankshaft.CDB_AreasOfInterest('SELECT * FROM observatory.acs2013',
+                                 'gini_index')
+```
+
+## API Usage
+
+Example
+
+```text
+http://eschbacher.cartodb.com/api/v2/sql?q=SELECT * FROM crankshaft.CDB_AreasOfInterest('SELECT * FROM observatory.acs2013','gini_index')
+```
+
+Result
+```json
+{
+  time: 0.120,
+  total_rows: 100,
+  rows: [{
+    moran_vals: 0.7213,
+    quadrant: 'High area',
+    significance: 0.03,
+    ids: 1,
+    column_value: 0.22
+  },
+  {
+    moran_vals: -0.7213,
+    quadrant: 'Low outlier',
+    significance: 0.13,
+    ids: 2,
+    column_value: 0.03
+  },
+  ...
+  ]
+}
+```
+
+## See Also
+
+crankshaft's areas of interest functions:
+
+* [CDB_AreasOfInterest_Global]()
+* [CDB_AreasOfInterest_Rate_Local]()
+* [CDB_AreasOfInterest_Rate_Global]()
+
+
+PostGIS clustering functions:
+
+* [ST_ClusterIntersecting](http://postgis.net/docs/manual-2.2/ST_ClusterIntersecting.html)
+* [ST_ClusterWithin](http://postgis.net/docs/manual-2.2/ST_ClusterWithin.html)
+
+
+-- removing below, working into above

 #### What is Moran's I and why is it significant for CartoDB?

--- a/doc/docs_template.md
+++ b/doc/docs_template.md
@@ -0,0 +1,24 @@
+
+## Name
+
+## Synopsis
+
+## Description
+
+Availability: v...
+
+## Examples
+
+```SQL
+-- example of the function in use
+SELECT cdb_awesome_function(the_geom, 'total_pop')
+FROM table_name
+```
+
+## API Usage
+
+_asdf_
+
+## See Also
+
+_Other function pages_
--- a/src/pg/sql/80_similarity_rank.sql
+++ b/src/pg/sql/80_similarity_rank.sql
@@ -0,0 +1,15 @@
+CREATE OR REPLACE FUNCTION cdb_SimilarityRank(cartodb_id numeric, query text)
+returns TABLE (cartodb_id NUMERIC, similarity NUMERIC)
+as $$
+  plpy.execute('SELECT cdb_crankshaft._cdb_crankshaft_activate_py()')
+  from crankshaft.similarity import similarity_rank
+  return similarity_rank(cartodb_id, query)
+$$ LANGUAGE plpythonu;
+
+CREATE OR REPLACE FUNCTION cdb_MostSimilar(cartodb_id numeric, query text ,matches numeric)
+returns TABLE (cartodb_id NUMERIC, similarity NUMERIC)
+as $$
+  plpy.execute('SELECT cdb_crankshaft._cdb_crankshaft_activate_py()')
+  from crankshaft.similarity import most_similar
+  return most_similar(matches, query)
+$$ LANGUAGE plpythonu;
--- a/src/py/crankshaft/crankshaft/init.py
+++ b/src/py/crankshaft/crankshaft/init.py
@@ -1,2 +1,3 @@
 import random_seeds
 import clustering
+import similarity
--- a/src/py/crankshaft/crankshaft/similarity/init.py
+++ b/src/py/crankshaft/crankshaft/similarity/init.py
@@ -0,0 +1 @@
+from similarity import * 
--- a/src/py/crankshaft/crankshaft/similarity/similarity.py
+++ b/src/py/crankshaft/crankshaft/similarity/similarity.py
@@ -0,0 +1,91 @@
+from sklearn.neighbors import NearestNeighbors
+import  scipy.stats as stats
+import numpy as np
+import plpy
+import time
+import cPickle
+
+
+def query_to_dictionary(result):
+    return [ dict(zip(r.keys(), r.values())) for r in result ]
+
+def drop_all_nan_columns(data):
+    return data[ :, ~np.isnan(data).all(axis=0)]
+    
+def fill_missing_na(data,val=None):
+    inds = np.where(np.isnan(data))
+    if val==None:
+        col_mean = stats.nanmean(data,axis=0)
+        data[inds]=np.take(col_mean,inds[1])
+    else:
+        data[inds]=np.take(val, inds[1])
+    return data
+    
+def similarity_rank(target_cartodb_id, query):
+    start_time  = time.time() 
+    #plpy.notice('converting to dictionary ', start_time) 
+    #data = query_to_dictionary(plpy.execute(query))  
+    plpy.notice('coverted , running query ', time.time() - start_time) 
+    
+    data = plpy.execute(query_only_values(query))
+    plpy.notice('run query  , getting cartodb_idsi', time.time() - start_time)
+    cartodb_ids = plpy.execute(query_cartodb_id(query))[0]['a']
+    target_id  = cartodb_ids.index(target_cartodb_id)
+    plpy.notice('run query  , extracting ', time.time() - start_time)
+    features, target = extract_features_target(data,target_id)
+    plpy.notice('extracted  , cleaning ', time.time() - start_time)
+    features = fill_missing_na(drop_all_nan_columns(features))
+    plpy.notice('cleaned , normalizing', start_time - time.time())
+    
+    normed_features, normed_target  = normalize_features(features,target)
+    plpy.notice('normalized , training ', time.time() - start_time )
+    tree = train(normed_features)
+    plpy.notice('normalized , pickling ', time.time() - start_time )
+    #plpy.notice('tree_dump ',  len(cPickle.dumps(tree, protocol=cPickle.HIGHEST_PROTOCOL)))
+    plpy.notice('pickles, querying ', time.time() - start_time)
+    dist, ind  = tree.kneighbors(normed_target)
+    plpy.notice('queried , rectifying', time.time() - start_time)
+    return zip(cartodb_ids, dist[0])
+
+def query_cartodb_id(query):
+    return 'select array_agg(cartodb_id) a from ({0}) b'.format(query)
+
+def query_only_values(query):
+    first_row = plpy.execute('select * from ({query}) a limit 1'.format(query=query))
+    just_values = ','.join([ key for key in  first_row[0].keys()  if key not in ['the_geom', 'the_geom_webmercator','cartodb_id']])
+    return 'select Array[{0}] a from ({1}) b '.format(just_values, query)
+
+
+def most_similar(matches,query):
+    data = plpy.execute(query)    
+    features, _ = extract_features_target(data)
+    results = []
+    for i in features:
+        target = features
+        dist,ind = tree.query(target, k=matches)
+        cartodb_ids  = [ dist[ind]['cartodb_id'] for index in ind ]
+        results.append(cartodb_ids)
+    return cartodb_ids, results
+    
+    
+def train(features):
+    tree = NearestNeighbors( n_neighbors=len(features), algorithm='auto').fit(features)
+    return tree
+    
+def normalize_features(features, target):
+    maxes = features.max(axis=0)
+    mins  = features.min(axis=0)
+    return (features - mins)/(maxes-mins), (target-mins)/(maxes-mins)
+ 
+def extract_row(row):
+    keys = row.keys()
+    values = row.values()
+    del values[ keys.index('cartodb_id')]
+    return values
+
+def extract_features_target(data, target_index=None):
+    target   = None
+    features = [row['a'] for row in data]
+    target   = features[target_index]
+    return np.array(features, dtype=float), np.array(target, dtype=float)
+    
--- a/src/py/crankshaft/setup.py
+++ b/src/py/crankshaft/setup.py
@@ -40,9 +40,9 @@ setup(

    # The choice of component versions is dictated by what's
    # provisioned in the production servers.
-    install_requires=['pysal==1.9.1'],
+    install_requires=['pysal==1.9.1', 'scikit-learn==0.17.1'],

-    requires=['pysal', 'numpy' ],
+    requires=['pysal', 'numpy','sklearn'],

    test_suite='test'
 )
Author	SHA1	Message	Date
Ubuntu	97b4949f84	performance imporvments	2016-05-27 19:31:37 +00:00
Ubuntu	df09d03de6	adding sklearn to deps	2016-05-27 14:59:24 +00:00
Ubuntu	b3c55614e3	fixing syntax	2016-05-27 14:58:43 +00:00
Ubuntu	1ddc338f3f	adding missing ;	2016-05-27 14:58:05 +00:00
Stuart Lynn	d7424b02e5	adding import to crankshaft __init__	2016-05-27 10:33:00 -04:00
Stuart Lynn	45705f3a16	adding function preflight	2016-05-27 10:29:47 -04:00
Stuart Lynn	1995721921	adding functions to drop columns which are all nan and fill nan values with the mean of those columns	2016-05-27 10:29:15 -04:00
Ubuntu	4630d6b549	debugging	2016-05-26 19:32:49 +00:00
Stuart Lynn	0fca6c3c1a	inital commit of similarity functions	2016-05-26 12:31:58 -04:00
Andy Eschbacher	fe22464b75	Merge pull request #22 from CartoDB/update-docs Update docs format	2016-05-23 09:51:44 -04:00
Javier Goizueta	cc4a35ebd9	Fix instructions to update/install the extension	2016-05-20 11:47:12 +02:00
Andy Eschbacher	00579cd838	adding template	2016-03-23 17:10:08 -04:00
Andy Eschbacher	3f20275d3d	adopting new format (wip)	2016-03-23 17:09:52 -04:00