Compare commits
13 Commits
find_conto
...
similarity
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
97b4949f84 | ||
|
|
df09d03de6 | ||
|
|
b3c55614e3 | ||
|
|
1ddc338f3f | ||
|
|
d7424b02e5 | ||
|
|
45705f3a16 | ||
|
|
1995721921 | ||
|
|
4630d6b549 | ||
|
|
0fca6c3c1a | ||
|
|
fe22464b75 | ||
|
|
cc4a35ebd9 | ||
|
|
00579cd838 | ||
|
|
3f20275d3d |
@@ -45,8 +45,8 @@ source envs/dev/bin/activate
|
||||
|
||||
Update extension in a working database with:
|
||||
|
||||
* `ALTER EXTENSION crankshaft VERSION TO 'current';`
|
||||
`ALTER EXTENSION crankshaft VERSION TO 'dev';`
|
||||
* `ALTER EXTENSION crankshaft UPDATE TO 'current';`
|
||||
`ALTER EXTENSION crankshaft UPDATE TO 'dev';`
|
||||
|
||||
Note: we keep the current development version install as 'dev' always;
|
||||
we update through the 'current' alias to allow changing the extension
|
||||
@@ -58,7 +58,10 @@ should be dropped manually before the update.
|
||||
If the extension has not previously been installed in a database,
|
||||
it can be installed directly with:
|
||||
|
||||
* `CREATE EXTENSION crankshaft WITH VERSION 'dev';`
|
||||
* `CREATE EXTENSION IF NOT EXISTS plpythonu;`
|
||||
`CREATE EXTENSION IF NOT EXISTS postgis;`
|
||||
`CREATE EXTENSION IF NOT EXISTS cartodb;`
|
||||
`CREATE EXTENSION crankshaft WITH VERSION 'dev';`
|
||||
|
||||
Note: the development extension uses the development python virtual
|
||||
environment automatically.
|
||||
|
||||
100
doc/02_moran.md
100
doc/02_moran.md
@@ -1,4 +1,102 @@
|
||||
### Moran's I
|
||||
## Name
|
||||
|
||||
CDB_AreasOfInterest -- returns a table with a cluster/outlier classification, the significance of a classification, an autocorrelation statistic (Local Moran's I), and the geometry id for each geometry in the original dataset.
|
||||
|
||||
## Synopsis
|
||||
|
||||
```sql
|
||||
table(numeric moran_val, text quadrant, numeric significance, int ids, numeric column_values) CDB_AreasOfInterest(text query, text column_name)
|
||||
|
||||
table(numeric moran_val, text quadrant, numeric significance, int ids, numeric column_values) CDB_AreasOfInterest(text query, text column_name, int permutations, text geom_column, text id_column, text weight_type, int num_ngbrs)
|
||||
```
|
||||
|
||||
## Description
|
||||
|
||||
CDB_AreasOfInterest is a table-returning function that classifies the geometries in a table by an attribute and gives a significance for that classification. This information can be used to find "Areas of Interest" by using the correlation of a geometry's attribute with that of its neighbors. Areas can be clusters, outliers, or neither (depending on which significance value is used).
|
||||
|
||||
Inputs:
|
||||
|
||||
* `query` (required): an arbitrary query against tables you have access to (e.g., in your account, shared in your organization, or through the Data Observatory). This string must contain the following columns: an id `INT` (e.g., `cartodb_id`), geometry (e.g., `the_geom`), and the numeric attribute which is specified in `column_name`
|
||||
* `column_name` (required): column to perform the area of interest analysis tool on. The data must be numeric (e.g., `float`, `int`, etc.)
|
||||
* `permutations` (optional): used to calculate the significance of a classification. Defaults to 99, which is sufficient in most situations.
|
||||
* `geom_column` (optional): the name of the geometry column. Data must be of type `geometry`.
|
||||
* `id_column` (optional): the name of the id column (e.g., `cartodb_id`). Data must be of type `int` or `bigint` and have a unique condition on the data.
|
||||
* `weight_type` (optional): the type of weight used for determining what defines a neighborhood. Options are `knn` or `queen`.
|
||||
* `num_ngbrs` (optional): the number of neighbors in a neighborhood around a geometry. Only used if `knn` is chosen above.
|
||||
|
||||
Outputs:
|
||||
|
||||
* `moran_val`: underlying correlation statistic used in analysis
|
||||
* `quadrant`: human-readable interpretation of classification
|
||||
* `significance`: significance of classification (closer to 0 is more significant)
|
||||
* `ids`: id of original geometry (used for joining against original table if desired -- see examples)
|
||||
* `column_values`: original column values from `column_name`
|
||||
|
||||
Availability: crankshaft v0.0.1 and above
|
||||
|
||||
## Examples
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
t.the_geom_webmercator,
|
||||
t.cartodb_id,
|
||||
aoi.significance,
|
||||
aoi.quadrant As aoi_quadrant
|
||||
FROM
|
||||
observatory.acs2013 As t
|
||||
JOIN
|
||||
crankshaft.CDB_AreasOfInterest('SELECT * FROM observatory.acs2013',
|
||||
'gini_index')
|
||||
```
|
||||
|
||||
## API Usage
|
||||
|
||||
Example
|
||||
|
||||
```text
|
||||
http://eschbacher.cartodb.com/api/v2/sql?q=SELECT * FROM crankshaft.CDB_AreasOfInterest('SELECT * FROM observatory.acs2013','gini_index')
|
||||
```
|
||||
|
||||
Result
|
||||
```json
|
||||
{
|
||||
time: 0.120,
|
||||
total_rows: 100,
|
||||
rows: [{
|
||||
moran_vals: 0.7213,
|
||||
quadrant: 'High area',
|
||||
significance: 0.03,
|
||||
ids: 1,
|
||||
column_value: 0.22
|
||||
},
|
||||
{
|
||||
moran_vals: -0.7213,
|
||||
quadrant: 'Low outlier',
|
||||
significance: 0.13,
|
||||
ids: 2,
|
||||
column_value: 0.03
|
||||
},
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## See Also
|
||||
|
||||
crankshaft's areas of interest functions:
|
||||
|
||||
* [CDB_AreasOfInterest_Global]()
|
||||
* [CDB_AreasOfInterest_Rate_Local]()
|
||||
* [CDB_AreasOfInterest_Rate_Global]()
|
||||
|
||||
|
||||
PostGIS clustering functions:
|
||||
|
||||
* [ST_ClusterIntersecting](http://postgis.net/docs/manual-2.2/ST_ClusterIntersecting.html)
|
||||
* [ST_ClusterWithin](http://postgis.net/docs/manual-2.2/ST_ClusterWithin.html)
|
||||
|
||||
|
||||
-- removing below, working into above
|
||||
|
||||
#### What is Moran's I and why is it significant for CartoDB?
|
||||
|
||||
|
||||
24
doc/docs_template.md
Normal file
24
doc/docs_template.md
Normal file
@@ -0,0 +1,24 @@
|
||||
|
||||
## Name
|
||||
|
||||
## Synopsis
|
||||
|
||||
## Description
|
||||
|
||||
Availability: v...
|
||||
|
||||
## Examples
|
||||
|
||||
```SQL
|
||||
-- example of the function in use
|
||||
SELECT cdb_awesome_function(the_geom, 'total_pop')
|
||||
FROM table_name
|
||||
```
|
||||
|
||||
## API Usage
|
||||
|
||||
_asdf_
|
||||
|
||||
## See Also
|
||||
|
||||
_Other function pages_
|
||||
15
src/pg/sql/80_similarity_rank.sql
Normal file
15
src/pg/sql/80_similarity_rank.sql
Normal file
@@ -0,0 +1,15 @@
|
||||
CREATE OR REPLACE FUNCTION cdb_SimilarityRank(cartodb_id numeric, query text)
|
||||
returns TABLE (cartodb_id NUMERIC, similarity NUMERIC)
|
||||
as $$
|
||||
plpy.execute('SELECT cdb_crankshaft._cdb_crankshaft_activate_py()')
|
||||
from crankshaft.similarity import similarity_rank
|
||||
return similarity_rank(cartodb_id, query)
|
||||
$$ LANGUAGE plpythonu;
|
||||
|
||||
CREATE OR REPLACE FUNCTION cdb_MostSimilar(cartodb_id numeric, query text ,matches numeric)
|
||||
returns TABLE (cartodb_id NUMERIC, similarity NUMERIC)
|
||||
as $$
|
||||
plpy.execute('SELECT cdb_crankshaft._cdb_crankshaft_activate_py()')
|
||||
from crankshaft.similarity import most_similar
|
||||
return most_similar(matches, query)
|
||||
$$ LANGUAGE plpythonu;
|
||||
@@ -1,2 +1,3 @@
|
||||
import random_seeds
|
||||
import clustering
|
||||
import similarity
|
||||
|
||||
1
src/py/crankshaft/crankshaft/similarity/__init__.py
Normal file
1
src/py/crankshaft/crankshaft/similarity/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
from similarity import *
|
||||
91
src/py/crankshaft/crankshaft/similarity/similarity.py
Normal file
91
src/py/crankshaft/crankshaft/similarity/similarity.py
Normal file
@@ -0,0 +1,91 @@
|
||||
from sklearn.neighbors import NearestNeighbors
|
||||
import scipy.stats as stats
|
||||
import numpy as np
|
||||
import plpy
|
||||
import time
|
||||
import cPickle
|
||||
|
||||
|
||||
def query_to_dictionary(result):
|
||||
return [ dict(zip(r.keys(), r.values())) for r in result ]
|
||||
|
||||
def drop_all_nan_columns(data):
|
||||
return data[ :, ~np.isnan(data).all(axis=0)]
|
||||
|
||||
def fill_missing_na(data,val=None):
|
||||
inds = np.where(np.isnan(data))
|
||||
if val==None:
|
||||
col_mean = stats.nanmean(data,axis=0)
|
||||
data[inds]=np.take(col_mean,inds[1])
|
||||
else:
|
||||
data[inds]=np.take(val, inds[1])
|
||||
return data
|
||||
|
||||
def similarity_rank(target_cartodb_id, query):
|
||||
start_time = time.time()
|
||||
#plpy.notice('converting to dictionary ', start_time)
|
||||
#data = query_to_dictionary(plpy.execute(query))
|
||||
plpy.notice('coverted , running query ', time.time() - start_time)
|
||||
|
||||
data = plpy.execute(query_only_values(query))
|
||||
plpy.notice('run query , getting cartodb_idsi', time.time() - start_time)
|
||||
cartodb_ids = plpy.execute(query_cartodb_id(query))[0]['a']
|
||||
target_id = cartodb_ids.index(target_cartodb_id)
|
||||
plpy.notice('run query , extracting ', time.time() - start_time)
|
||||
features, target = extract_features_target(data,target_id)
|
||||
plpy.notice('extracted , cleaning ', time.time() - start_time)
|
||||
features = fill_missing_na(drop_all_nan_columns(features))
|
||||
plpy.notice('cleaned , normalizing', start_time - time.time())
|
||||
|
||||
normed_features, normed_target = normalize_features(features,target)
|
||||
plpy.notice('normalized , training ', time.time() - start_time )
|
||||
tree = train(normed_features)
|
||||
plpy.notice('normalized , pickling ', time.time() - start_time )
|
||||
#plpy.notice('tree_dump ', len(cPickle.dumps(tree, protocol=cPickle.HIGHEST_PROTOCOL)))
|
||||
plpy.notice('pickles, querying ', time.time() - start_time)
|
||||
dist, ind = tree.kneighbors(normed_target)
|
||||
plpy.notice('queried , rectifying', time.time() - start_time)
|
||||
return zip(cartodb_ids, dist[0])
|
||||
|
||||
def query_cartodb_id(query):
|
||||
return 'select array_agg(cartodb_id) a from ({0}) b'.format(query)
|
||||
|
||||
def query_only_values(query):
|
||||
first_row = plpy.execute('select * from ({query}) a limit 1'.format(query=query))
|
||||
just_values = ','.join([ key for key in first_row[0].keys() if key not in ['the_geom', 'the_geom_webmercator','cartodb_id']])
|
||||
return 'select Array[{0}] a from ({1}) b '.format(just_values, query)
|
||||
|
||||
|
||||
def most_similar(matches,query):
|
||||
data = plpy.execute(query)
|
||||
features, _ = extract_features_target(data)
|
||||
results = []
|
||||
for i in features:
|
||||
target = features
|
||||
dist,ind = tree.query(target, k=matches)
|
||||
cartodb_ids = [ dist[ind]['cartodb_id'] for index in ind ]
|
||||
results.append(cartodb_ids)
|
||||
return cartodb_ids, results
|
||||
|
||||
|
||||
def train(features):
|
||||
tree = NearestNeighbors( n_neighbors=len(features), algorithm='auto').fit(features)
|
||||
return tree
|
||||
|
||||
def normalize_features(features, target):
|
||||
maxes = features.max(axis=0)
|
||||
mins = features.min(axis=0)
|
||||
return (features - mins)/(maxes-mins), (target-mins)/(maxes-mins)
|
||||
|
||||
def extract_row(row):
|
||||
keys = row.keys()
|
||||
values = row.values()
|
||||
del values[ keys.index('cartodb_id')]
|
||||
return values
|
||||
|
||||
def extract_features_target(data, target_index=None):
|
||||
target = None
|
||||
features = [row['a'] for row in data]
|
||||
target = features[target_index]
|
||||
return np.array(features, dtype=float), np.array(target, dtype=float)
|
||||
|
||||
@@ -40,9 +40,9 @@ setup(
|
||||
|
||||
# The choice of component versions is dictated by what's
|
||||
# provisioned in the production servers.
|
||||
install_requires=['pysal==1.9.1'],
|
||||
install_requires=['pysal==1.9.1', 'scikit-learn==0.17.1'],
|
||||
|
||||
requires=['pysal', 'numpy' ],
|
||||
requires=['pysal', 'numpy','sklearn'],
|
||||
|
||||
test_suite='test'
|
||||
)
|
||||
|
||||
Reference in New Issue
Block a user