release artifact

Merge branch 'develop' into release-v-1.4.0
update NEWS.md
2017-03-22 15:17:19 +00:00 · 2017-03-22 15:16:35 +00:00 · 2017-03-22 15:14:35 +00:00 · 2017-03-22 15:12:50 +00:00 · 2017-03-21 21:24:50 +00:00 · 2017-03-21 17:26:02 +00:00
14 changed files with 5800 additions and 950 deletions
--- a/NEWS.md
+++ b/NEWS.md
@@ -1,3 +1,27 @@
+1.4.0 (2017-03-21)
+
+__API Changes__
+
+* Allow for override of `target_area` and `target_geoms` in `OBS_GetMeta`
+  ([#276](https://github.com/CartoDB/observatory-extension/pull/265)).  This
+  allows the interface to work with points and sparse areas much btter.
+* Allow for override of `max_timespan_rank` and `max_score_rank` on an
+  item-by-item basis for metadata.
+* `numer_description`, `geom_description`, `denom_description`,
+  `numer_t_description`, `denom_t_description` and `geom_t_description` now
+  returned as part of `OBS_GetMeta`.
+
+__Improvements__
+
+* Reduced amount of simplification done on input geometries (from 0.0001 above
+  500 points to 0.00001 above 1000 points).
+* Added tests to confirm that accurate results are returned from automatic
+  boundary selection
+
+1.3.5 (2017-03-15)
+
+No changes.  Artifact to allow for data update.
+
 1.3.4 (2017-03-10)

 __Bugfixes__
--- a/doc/boundary_functions.md
+++ b/doc/boundary_functions.md
@@ -4,7 +4,7 @@ Use the following functions to retrieve [Boundary](https://carto.com/docs/carto-

 You can [access](https://carto.com/docs/carto-engine/data/accessing) boundaries through CARTO Builder. The same methods will work if you are using the CARTO Engine to develop your application. We [encourage you](http://docs/carto-engine/data/accessing/#best-practices) to use table modifying methods (UPDATE and INSERT) over dynamic methods (SELECT).

-## OBS_GetBoundariesByGeometry(polygon geometry, geometry_id text)
+## OBS_GetBoundariesByGeometry(geom geometry, geometry_id text)

 The ```OBS_GetBoundariesByGeometry(geometry, geometry_id)``` method returns a set of boundary geometries that intersect a supplied geometry. This can be used to find all boundaries that are within or overlap a bounding box. You have the ability to choose whether to retrieve all boundaries that intersect your supplied bounding box or only those that fall entirely inside of your bounding box.

@@ -12,7 +12,7 @@ The ```OBS_GetBoundariesByGeometry(geometry, geometry_id)``` method returns a se

 Name |Description
 --- | ---
-polygon | a bounding box or other WGS84 geometry
+geom | a WGS84 geometry
 geometry_id | a string identifier for a boundary geometry
 timespan (optional) | year(s) to request from ('NULL' (default) gives most recent)
 overlap_type (optional) | one of '[intersects](http://postgis.net/docs/manual-2.2/ST_Intersects.html)' (default), '[contains](http://postgis.net/docs/manual-2.2/ST_Contains.html)', or '[within](http://postgis.net/docs/manual-2.2/ST_Within.html)'.
@@ -26,7 +26,7 @@ Column Name | Description
 the_geom | a boundary geometry (e.g., US Census tract boundaries)
 geom_refs | a string identifier for the geometry (e.g., geoids of US Census tracts)

-If geometries are not found for the requested `polygon`, `geometry_id`, `timespan`, or `overlap_type`, then null values are returned.
+If geometries are not found for the requested `geom`, `geometry_id`, `timespan`, or `overlap_type`, then null values are returned.

 #### Example

@@ -44,7 +44,6 @@ FROM OBS_GetBoundariesByGeometry(

 #### Errors

-* If a geometry other than a point is passed as the first argument, an error is thrown: `Invalid geometry type (ST_Polygon), expecting 'ST_Point'`
 * If an `overlap_type` other than the valid ones listed above is entered, then an error is thrown

 ## OBS_GetPointsByGeometry(polygon geometry, geometry_id text)
--- a/doc/measures_functions.md
+++ b/doc/measures_functions.md
@@ -196,7 +196,7 @@ UPDATE tablename
 SET segmentation = OBS_GetCategory(the_geom, 'us.census.spielman_singleton_segments.X55')
 ```

-## OBS_GetMeta(extent geometry, metadata json, max_timespan_rank, max_boundary_score_rank, num_target_geoms)
+## OBS_GetMeta(extent geometry, metadata json, max_timespan_rank, max_score_rank, target_geoms)

 The ```OBS_GetMeta(extent, metadata)``` function returns a completed Data
 Observatory metadata JSON Object for use in ```OBS_GetData(geomvals,
@@ -215,7 +215,7 @@ extent | A geometry of the extent of the input geometries
 metadata | A JSON array composed of metadata input objects.  Each indicates one desired measure for an output column, and optionally additional parameters about that column
 max_timespan_rank | How many historical time periods to include.  Defaults to 1
 max_boundary_score_rank | How many alternative boundary levels to include.  Defaults to 1
-num_target_geoms | Target number of geometries.  Boundaries with close to this many objects within `extent` will be ranked highest. 
+target_geoms | Target number of geometries.  Boundaries with close to this many objects within `extent` will be ranked highest. 

 The schema of the metadata input objects are as follows:

@@ -227,6 +227,10 @@ normalization | The desired normalization.  One of 'area', 'prenormalized', or '
 denom_id | Identifier for a desired normalization column in case `normalization` is 'denominated'.  Will be automatically assigned if necessary.  Ignored if this metadata object specifies a geometry.
 numer_timespan | The desired timespan for the measurement.  Defaults to most recent timespan available if left unspecified.
 geom_timespan | The desired timespan for the geometry.  Defaults to timespan matching numer_timespan if left unspecified.
+target_area | Instead of aiming to have `target_geoms` in the area of the geometry passed as `extent`, fill this area.  Unit is square degrees WGS84.  Set this to `0` if you want to use the smallest source geometry for this element of metadata, for example if you're passing in points.
+target_geoms | Override global `target_geoms` for this element of metadata
+max_timespan_rank | Override global `max_timespan_rank` for this element of metadata
+max_score_rank | Override global `max_score_rank` for this element of metadata

 #### Returns

@@ -245,6 +249,8 @@ Metadata Output Key | Description
 numer_id | Identifier for desired measurement
 numer_timespan | Timespan that will be used of the desired measurement
 numer_name | Human-readable name of desired measure
+numer_description | Long human-readable description of the desired measure
+numer_t_description | Further information about the source table
 numer_type | PostgreSQL/PostGIS type of desired measure
 numer_colname | Internal identifier for column name
 numer_tablename | Internal identifier for table
@@ -252,6 +258,8 @@ numer_geomref_colname | Internal identifier for geomref column name
 denom_id | Identifier for desired normalization
 denom_timespan | Timespan that will be used of the desired normalization
 denom_name | Human-readable name of desired measure's normalization
+denom_description | Long human-readable description of the desired measure's normalization
+denom_t_description | Further information about the source table
 denom_type | PostgreSQL/PostGIS type of desired measure's normalization
 denom_colname | Internal identifier for normalization column name
 denom_tablename | Internal identifier for normalization table
@@ -259,12 +267,14 @@ denom_geomref_colname | Internal identifier for normalization geomref column nam
 geom_id | Identifier for desired boundary geometry
 geom_timespan | Timespan that will be used of the desired boundary geometry
 geom_name | Human-readable name of desired boundary geometry
+geom_description | Long human-readable description of the desired boundary geometry
+geom_t_description | Further information about the source table
 geom_type | PostgreSQL/PostGIS type of desired boundary geometry
 geom_colname | Internal identifier for boundary geometry column name
 geom_tablename | Internal identifier for boundary geometry table
 geom_geomref_colname | Internal identifier for boundary geometry ref column name
 timespan_rank | Ranking of this measurement by time, most recent is 1, second most recent 2, etc.
-score | The score of this measurement's boundary compared to the `extent` and `num_target_geoms` passed in.  Between 0 and 100.
+score | The score of this measurement's boundary compared to the `extent` and `target_geoms` passed in.  Between 0 and 100.
 score_rank | The ranking of this measurement's boundary, highest ranked is 1, second is 2, etc.
 numer_aggregate | The aggregate type of the numerator, either `sum`, `average`, `median`, or blank
 denom_aggregate | The aggregate type of the denominator, either `sum`, `average`, `median`, or blank
--- a/release/observatory--1.3.5.sql
+++ b/release/observatory--1.3.5.sql
--- a/release/observatory--1.4.0.sql
+++ b/release/observatory--1.4.0.sql
--- a/release/observatory.control
+++ b/release/observatory.control
@@ -1,5 +1,5 @@
 comment = 'CartoDB Observatory backend extension'
-default_version = '1.3.4'
+default_version = '1.4.0'
 requires = 'postgis'
 superuser = true
 schema = cdb_observatory
--- a/src/pg/observatory.control
+++ b/src/pg/observatory.control
@@ -1,5 +1,5 @@
 comment = 'CartoDB Observatory backend extension'
-default_version = '1.3.4'
+default_version = '1.4.0'
 requires = 'postgis'
 superuser = true
 schema = cdb_observatory
--- a/src/pg/sql/41_observatory_augmentation.sql
+++ b/src/pg/sql/41_observatory_augmentation.sql
@@ -126,9 +126,23 @@ BEGIN
  geom_filters := (SELECT Array_Agg(val) FILTER (WHERE val IS NOT NULL) FROM (SELECT (JSON_Array_Elements(params))->>'geom_id' val) bar);
  meta_filter_clause := '(m.numer_id = ANY ($6) OR m.geom_id = ANY ($7))';

-  scores_clause := 'SELECT *
-                    FROM cdb_observatory._OBS_GetGeometryScores($1,
-                    (SELECT Array_Agg(geom_id) FROM meta), $2) scores ';
+  scores_clause := ' agg_geoms AS (
+    SELECT target_geoms, target_area, ARRAY_AGG(geom_id) geom_ids
+    FROM meta
+    GROUP BY target_geoms, target_area
+  ), scores AS (
+    SELECT target_geoms, target_area,
+      CASE target_area
+      -- point-specific, just order by numgeoms instead of score
+      WHEN 0 THEN scores.numgeoms
+      -- has some area, use proper scoring
+      ELSE scores.score
+      END AS score,
+           scores.numgeoms, scores.table_id, scores.column_id
+    FROM agg_geoms,
+         LATERAL cdb_observatory._OBS_GetGeometryScores($1,
+            geom_ids, COALESCE(target_geoms, $2), target_area) scores
+  ) ';

  IF JSON_Array_Length(params) = 1 THEN
    IF numer_filters IS NULL AND geom_filters IS NOT NULL THEN
@@ -142,9 +156,11 @@ BEGIN
    END IF;

    IF geom_filters IS NOT NULL AND numer_filters IS NOT NULL THEN
-      scores_clause := 'SELECT 1 score, null, geom_tid table_id, geom_id column_id,
-                               null, null, null, null, null, null
-                        FROM meta ';
+      scores_clause := 'scores AS (
+        SELECT NULL::INTEGER target_geoms, NULL::Numeric target_area,
+        1 score, null, geom_tid table_id, geom_id column_id,
+        NULL::Integer numgeoms
+        FROM meta) ';
    END IF;
  END IF;

@@ -156,7 +172,11 @@ BEGIN
        (unnest($3))->>'geom_id' geom_id,
        (unnest($3))->>'numer_timespan' numer_timespan,
        (unnest($3))->>'geom_timespan' geom_timespan,
-        (unnest($3))->>'normalization' normalization
+        (unnest($3))->>'normalization' normalization,
+        (unnest($3))->>'max_timespan_rank' max_timespan_rank,
+        (unnest($3))->>'max_score_rank' max_score_rank,
+        ((unnest($3))->>'target_geoms')::INTEGER target_geoms,
+        ((unnest($3))->>'target_area')::Numeric target_area
    ), meta AS (SELECT
        id,
        f.numer_id,
@@ -166,6 +186,8 @@ BEGIN
        CASE WHEN f.numer_id IS NULL THEN NULL ELSE numer_tablename END numer_tablename,
        CASE WHEN f.numer_id IS NULL THEN NULL ELSE numer_type END numer_type,
        CASE WHEN f.numer_id IS NULL THEN NULL ELSE numer_name END numer_name,
+        CASE WHEN f.numer_id IS NULL THEN NULL ELSE numer_description END numer_description,
+        CASE WHEN f.numer_id IS NULL THEN NULL ELSE numer_t_description END numer_t_description,
        CASE WHEN f.numer_id IS NULL THEN NULL ELSE m.numer_timespan END numer_timespan,
        CASE WHEN f.numer_id IS NULL THEN NULL ELSE m.denom_id END denom_id,
        CASE WHEN f.numer_id IS NULL THEN NULL ELSE denom_aggregate END denom_aggregate,
@@ -173,6 +195,8 @@ BEGIN
        CASE WHEN f.numer_id IS NULL THEN NULL ELSE denom_geomref_colname END denom_geomref_colname,
        CASE WHEN f.numer_id IS NULL THEN NULL ELSE denom_tablename END denom_tablename,
        CASE WHEN f.numer_id IS NULL THEN NULL ELSE denom_name END denom_name,
+        CASE WHEN f.numer_id IS NULL THEN NULL ELSE denom_description END denom_description,
+        CASE WHEN f.numer_id IS NULL THEN NULL ELSE denom_t_description END denom_t_description,
        CASE WHEN f.numer_id IS NULL THEN NULL ELSE denom_type END denom_type,
        CASE WHEN f.numer_id IS NULL THEN NULL ELSE denom_reltype END denom_reltype,
        m.geom_id,
@@ -182,8 +206,14 @@ BEGIN
        geom_geomref_colname,
        geom_tablename,
        geom_name,
+        geom_description,
+        geom_t_description,
        geom_type,
-        normalization
+        normalization,
+        max_timespan_rank,
+        max_score_rank,
+        target_geoms,
+        target_area
      FROM observatory.obs_meta m JOIN _filters f
      ON CASE WHEN f.numer_id IS NULL THEN m.geom_id ELSE m.numer_id END =
         CASE WHEN f.numer_id IS NULL THEN f.geom_id ELSE f.numer_id END
@@ -194,9 +224,8 @@ BEGIN
        AND (m.geom_id = f.geom_id OR COALESCE(f.geom_id, '') = '')
        AND (m.geom_timespan = f.geom_timespan OR COALESCE(f.geom_timespan, '') = '')
        AND (m.numer_timespan = f.numer_timespan OR COALESCE(f.numer_timespan, '') = '')
-    ), scores AS (
-        %s
-    ), groups AS (SELECT
+    ), %s
+    , groups AS (SELECT
        id,
        scores.score,
        numer_timespan,
@@ -213,39 +242,46 @@ BEGIN
          'numer_geomref_colname', cdb_observatory.FIRST(meta.numer_geomref_colname),
          'numer_tablename', cdb_observatory.FIRST(meta.numer_tablename),
          'numer_type', cdb_observatory.FIRST(meta.numer_type),
-          --'numer_description', cdb_observatory.FIRST(meta.numer_description),
-          --'numer_t_description', cdb_observatory.FIRST(meta.numer_t_description),
+          'numer_description', cdb_observatory.FIRST(meta.numer_description),
+          'numer_t_description', cdb_observatory.FIRST(meta.numer_t_description),
          'denom_aggregate', cdb_observatory.FIRST(meta.denom_aggregate),
          'denom_colname', cdb_observatory.FIRST(denom_colname),
          'denom_geomref_colname', cdb_observatory.FIRST(denom_geomref_colname),
          'denom_tablename', cdb_observatory.FIRST(denom_tablename),
          'denom_type', cdb_observatory.FIRST(meta.denom_type),
          'denom_reltype', cdb_observatory.FIRST(meta.denom_reltype),
-          --'denom_description', cdb_observatory.FIRST(meta.denom_description),
-          --'denom_t_description', cdb_observatory.FIRST(meta.denom_t_description),
+          'denom_description', cdb_observatory.FIRST(meta.denom_description),
+          'denom_t_description', cdb_observatory.FIRST(meta.denom_t_description),
          'geom_colname', cdb_observatory.FIRST(geom_colname),
          'geom_geomref_colname', cdb_observatory.FIRST(geom_geomref_colname),
          'geom_tablename', cdb_observatory.FIRST(geom_tablename),
          'geom_type', cdb_observatory.FIRST(meta.geom_type),
          'geom_timespan', cdb_observatory.FIRST(meta.geom_timespan),
-          --'geom_description', cdb_observatory.FIRST(meta.geom_description),
-          --'geom_t_description', cdb_observatory.FIRST(meta.geom_t_description),
+          'geom_description', cdb_observatory.FIRST(meta.geom_description),
+          'geom_t_description', cdb_observatory.FIRST(meta.geom_t_description),
          'numer_timespan', cdb_observatory.FIRST(numer_timespan),
          'numer_name', cdb_observatory.FIRST(numer_name),
          'denom_name', cdb_observatory.FIRST(denom_name),
          'geom_name', cdb_observatory.FIRST(geom_name),
          'normalization', cdb_observatory.FIRST(normalization),
+          'max_timespan_rank', cdb_observatory.FIRST(max_timespan_rank),
+          'max_score_rank', cdb_observatory.FIRST(max_score_rank),
+          'target_geoms', cdb_observatory.FIRST(scores.target_geoms),
+          'target_area', cdb_observatory.FIRST(scores.target_area),
+          'num_geoms', cdb_observatory.FIRST(scores.numgeoms),
          'denom_id', denom_id,
          'geom_id', meta.geom_id
        ) metadata
      FROM meta, scores
      WHERE meta.geom_id = scores.column_id
        AND meta.geom_tid = scores.table_id
+        AND COALESCE(meta.target_geoms, 0) = COALESCE(scores.target_geoms, 0)
+        AND COALESCE(meta.target_area, 0) = COALESCE(scores.target_area, 0)
      GROUP BY id, score, numer_id, denom_id, geom_id, numer_timespan
    ) SELECT JSON_AGG(metadata ORDER BY id)
      FROM groups
-      WHERE timespan_rank <= $4
-        AND score_rank <= $5
+      WHERE timespan_rank <= Coalesce((metadata->>'max_timespan_rank')::INTEGER, $4)
+        AND score_rank <= Coalesce((metadata->>'max_score_rank')::INTEGER, $5)
  $string$, meta_filter_clause, scores_clause)
  INTO result
  USING
@@ -772,8 +808,8 @@ BEGIN
    RETURN QUERY EXECUTE format($query$
      WITH _raw_geoms AS (%s),
      _geoms AS (SELECT id,
-        CASE WHEN (ST_NPoints(geom) > 500)
-               THEN ST_CollectionExtract(ST_MakeValid(ST_SimplifyVW(geom, 0.0001)), 3)
+        CASE WHEN (ST_NPoints(geom) > 1000)
+               THEN ST_CollectionExtract(ST_MakeValid(ST_SimplifyVW(geom, 0.00001)), 3)
             ELSE geom END geom
        FROM _raw_geoms),
      _procgeoms AS (SELECT _geoms.id, _geoms.geom %s %s
--- a/src/pg/sql/42_observatory_exploration.sql
+++ b/src/pg/sql/42_observatory_exploration.sql
@@ -418,7 +418,8 @@ $$ LANGUAGE plpgsql;
 CREATE OR REPLACE FUNCTION cdb_observatory._OBS_GetGeometryScores(
  bounds Geometry(Geometry, 4326) DEFAULT NULL,
  filter_geom_ids TEXT[] DEFAULT NULL,
-  desired_num_geoms INTEGER DEFAULT NULL
+  desired_num_geoms INTEGER DEFAULT NULL,
+  desired_area NUMERIC DEFAULT NULL
 ) RETURNS TABLE (
  score NUMERIC,
  numtiles BIGINT,
@@ -430,6 +431,8 @@ CREATE OR REPLACE FUNCTION cdb_observatory._OBS_GetGeometryScores(
  estnumgeoms NUMERIC,
  meanmediansize NUMERIC
 ) AS $$
+DECLARE
+  num_geoms_multiplier Numeric;
 BEGIN
  IF desired_num_geoms IS NULL THEN
    desired_num_geoms := 3000;
@@ -440,6 +443,18 @@ BEGIN
  IF ST_Npoints(bounds) > 10000 THEN
    bounds := ST_Envelope(bounds);
  END IF;
+  IF desired_area IS NULL THEN
+    desired_area := ST_Area(bounds);
+  END IF;
+
+  -- In case of points, desired_area will be 0.  We still want an accurate
+  -- estimate of numgeoms in that case.
+  IF desired_area = 0 THEN
+    num_geoms_multiplier := 1;
+  ELSE
+    num_geoms_multiplier := Coalesce(desired_area / Nullif(ST_Area(bounds), 0), 1);
+  END IF;
+
  RETURN QUERY
  EXECUTE $string$
    WITH clipped_geom AS (
@@ -453,13 +468,11 @@ BEGIN
    ), clipped_geom_countagg AS (
      SELECT column_id, table_id
        , BOOL_AND(ST_BandIsNoData(clipped_tile, 1)) nodata
-        , ST_CountAgg(clipped_tile, 1, False)::Numeric pixels -- -10
      FROM clipped_geom
      GROUP BY column_id, table_id
    ), clipped_geom_reagg AS (
      SELECT COUNT(*)::BIGINT cnt, a.column_id, a.table_id,
             cdb_observatory.FIRST(nodata) first_nodata,
-             cdb_observatory.FIRST(pixels) first_pixel,
             cdb_observatory.FIRST(tile) first_tile,
             (ST_SummaryStatsAgg(clipped_tile, 1, False)).sum::Numeric sum_geoms, -- ND
             (ST_SummaryStatsAgg(clipped_tile, 2, False)).mean::Numeric / 255 mean_fill --ND
@@ -474,9 +487,8 @@ BEGIN
        , (CASE WHEN first_nodata IS FALSE
                THEN sum_geoms
                ELSE COALESCE(ST_Value(first_tile, 1, ST_PointOnSurface($1)), 0)
-                  * (ST_Area($1) / ST_Area(ST_PixelAsPolygon(first_tile, 0, 0))
-                    * first_pixel) -- -20
-          END)::Numeric
+                  * (ST_Area($1) / ST_Area(ST_PixelAsPolygon(first_tile, 0, 0)))
+          END)::Numeric * $4
        AS numgeoms
        , (CASE WHEN first_nodata IS FALSE
                THEN mean_fill
@@ -490,7 +502,7 @@ BEGIN
      ((100.0 / (1+abs(log(0.0001 + $3) - log(0.0001 + numgeoms::Numeric)))) * percentfill)::Numeric
      AS score, *
      FROM final
-  $string$ USING bounds, filter_geom_ids, desired_num_geoms;
+  $string$ USING bounds, filter_geom_ids, desired_num_geoms, num_geoms_multiplier;
  RETURN;
 END
 $$ LANGUAGE plpgsql IMMUTABLE;
--- a/src/pg/test/expected/41_observatory_augmentation_test.out
+++ b/src/pg/test/expected/41_observatory_augmentation_test.out
@@ -261,3 +261,31 @@ t|t
 ary_type|obs_getdata_api_geomrefs_args_string_return
 t|t
 (1 row)
+setseed
+
+(1 row)
+bg_sample|bg_max_error|bg_avg_error|bg_min_error
+1|t|t|t
+2|t|t|t
+3|t|t|t
+5|t|t|t
+10|t|t|t
+25|t|t|t
+50|t|t|t
+100|t|t|t
+2085|t|t|t
+(9 rows)
+tract_sample|tract_max_error|tract_avg_error|tract_min_error
+1|t|t|t
+2|t|t|t
+3|t|t|t
+5|t|t|t
+10|t|t|t
+25|t|t|t
+50|t|t|t
+100|t|t|t
+761|t|t|t
+(9 rows)
+no_bg_point_error
+t
+(1 row)
--- a/src/pg/test/expected/42_observatory_exploration_test.out
+++ b/src/pg/test/expected/42_observatory_exploration_test.out
@@ -159,21 +159,36 @@ t
 _obs_geometryscores_2500km_buffer
 t
 (1 row)
-_obs_geometryscores_numgeoms_500m_buffer
-t
-(1 row)
-_obs_geometryscores_numgeoms_5km_buffer
-t
-(1 row)
-_obs_geometryscores_numgeoms_50km_buffer
-t
-(1 row)
-_obs_geometryscores_numgeoms_500km_buffer
-t
-(1 row)
-_obs_geometryscores_numgeoms_2500km_buffer
-t
-(1 row)
+column_id|_obs_geometryscores_numgeoms_500m_buffer
+us.census.tiger.block_group|2
+us.census.tiger.census_tract|1
+us.census.tiger.zcta5|0
+us.census.tiger.county|0
+(4 rows)
+column_id|_obs_geometryscores_numgeoms_5km_buffer
+us.census.tiger.block_group|244
+us.census.tiger.census_tract|78
+us.census.tiger.zcta5|9
+us.census.tiger.county|0
+(4 rows)
+column_id|_obs_geometryscores_numgeoms_50km_buffer
+us.census.tiger.block_group|10817
+us.census.tiger.census_tract|3396
+us.census.tiger.zcta5|484
+us.census.tiger.county|11
+(4 rows)
+column_id|_obs_geometryscores_numgeoms_500km_buffer
+us.census.tiger.block_group|48567
+us.census.tiger.census_tract|15823
+us.census.tiger.zcta5|6466
+us.census.tiger.county|295
+(4 rows)
+column_id|_obs_geometryscores_numgeoms_2500km_buffer
+us.census.tiger.block_group|165852
+us.census.tiger.census_tract|55283
+us.census.tiger.zcta5|27046
+us.census.tiger.county|2551
+(4 rows)
 _obs_geometryscores_500km_buffer_50_geoms
 t
 (1 row)
@@ -186,6 +201,12 @@ t
 _obs_geometryscores_500km_buffer_25000_geoms
 t
 (1 row)
+testarea_uses_tract
+t
+(1 row)
+points_use_bg
+t
+(1 row)
 _total_pop_in_legacy_builder_metadata
 t
 (1 row)
--- a/src/pg/test/fixtures/load_fixtures.sql
+++ b/src/pg/test/fixtures/load_fixtures.sql
--- a/src/pg/test/sql/41_observatory_augmentation_test.sql
+++ b/src/pg/test/sql/41_observatory_augmentation_test.sql
@@ -798,3 +798,146 @@ SELECT json_typeof(data->0->'value') = 'array' ary_type,
 AS OBS_GetData_API_geomrefs_args_string_return
 FROM cdb_observatory.obs_getdata(array['36047'],
      '[{"numer_type": "text", "numer_colname": "obs_getboundarybyid", "api_method": "obs_getboundarybyid", "api_args": ["us.census.tiger.county"]}]');
+
+-- Ensure consistent results below.
+select setseed(0);
+
+-- Check that random assortment of block groups in Brooklyn return accurate data
+WITH _geoms AS (
+  SELECT
+    (data->0->>'value')::geometry the_geom,
+    data->0->>'geomref' geom_ref,
+    (data->1->>'value')::numeric total_pop
+  FROM cdb_observatory.OBS_GetData(
+    array[(st_buffer(cdb_observatory._testpoint(), 0.2), 1)::geomval],
+    (SELECT cdb_observatory.OBS_GetMeta(ST_MakeEnvelope(-179, 89, 179, -89, 4326),
+      '[{"geom_id": "us.census.tiger.block_group"},
+        {"numer_id": "us.census.acs.B01003001", "geom_id": "us.census.tiger.block_group", "normalization": "predenom"}]')),
+    FALSE
+  )
+  WHERE data->0->>'geomref' LIKE '36047%'
+  ORDER BY RANDOM()
+), geoms AS (
+  SELECT *, row_number() OVER () cartodb_id FROM _geoms
+), samples AS (
+  SELECT COUNT(*) cnt, unnest(ARRAY[1, 2, 3, 5, 10, 25, 50, 100, COUNT(*)]) sample FROM geoms
+), filtered AS (
+  SELECT * FROM geoms, samples WHERE cartodb_id % (cnt / sample) = 0
+), summary AS (
+  SELECT sample, ST_SetSRID(ST_Extent(the_geom), 4326) extent,
+    COUNT(*)::INT cnt,
+    ARRAY_AGG((the_geom, cartodb_id)::geomval) geomvals,
+    SUM(ST_Area(the_geom))::Numeric sumarea
+  FROM filtered
+  GROUP BY sample
+), meta AS (
+  SELECT sample, cdb_observatory.OBS_GetMeta(extent,
+    ('[{"numer_id": "us.census.acs.B01003001", "normalization": "predenom", "target_area": ' || sumarea || '}]')::JSON,
+    1, 1, cnt) meta
+  FROM summary
+  GROUP BY sample, extent, cnt, sumarea
+), results AS (
+  SELECT summary.sample, id, meta->0->>'geom_id' geom_id, (data->0->>'value')::Numeric as val
+  FROM summary, meta, LATERAL cdb_observatory.OBS_GetData(geomvals, meta) data
+  WHERE summary.sample = meta.sample
+) SELECT sample bg_sample
+ , MAX(100 * abs((geoms.total_pop - val) / Coalesce(NullIf(total_pop, 0), NULL)))::Numeric(10, 2) < 10 bg_max_error
+ , AVG(100 * abs((geoms.total_pop - val) / Coalesce(NullIf(total_pop, 0), NULL)))::Numeric(10, 2) < 10 bg_avg_error
+ , MIN(100 * abs((geoms.total_pop - val) / Coalesce(NullIf(total_pop, 0), NULL)))::Numeric(10, 2) < 10 bg_min_error
+FROM geoms, results
+WHERE cartodb_id = id
+GROUP BY sample
+ORDER BY sample
+;
+
+-- Check that random assortment of tracts in Brooklyn return accurate data
+WITH _geoms AS (
+  SELECT
+    (data->0->>'value')::geometry the_geom,
+    data->0->>'geomref' geom_ref,
+    (data->1->>'value')::numeric total_pop
+  FROM cdb_observatory.OBS_GetData(
+    array[(st_buffer(cdb_observatory._testpoint(), 0.2), 1)::geomval],
+    (SELECT cdb_observatory.OBS_GetMeta(ST_MakeEnvelope(-179, 89, 179, -89, 4326),
+      '[{"geom_id": "us.census.tiger.census_tract"},
+        {"numer_id": "us.census.acs.B01003001", "geom_id": "us.census.tiger.census_tract", "normalization": "predenom"}]')),
+    FALSE
+  )
+  WHERE data->0->>'geomref' LIKE '36047%'
+  ORDER BY RANDOM()
+), geoms AS (
+  SELECT *, row_number() OVER () cartodb_id FROM _geoms
+), samples AS (
+  SELECT COUNT(*) cnt, unnest(ARRAY[1, 2, 3, 5, 10, 25, 50, 100, COUNT(*)]) sample FROM geoms
+), filtered AS (
+  SELECT * FROM geoms, samples WHERE cartodb_id % (cnt / sample) = 0
+), summary AS (
+  SELECT sample, ST_SetSRID(ST_Extent(the_geom), 4326) extent,
+    COUNT(*)::INT cnt,
+    ARRAY_AGG((the_geom, cartodb_id)::geomval) geomvals,
+    SUM(ST_Area(the_geom))::Numeric sumarea
+  FROM filtered
+  GROUP BY sample
+), meta AS (
+  SELECT sample, cdb_observatory.OBS_GetMeta(extent,
+    ('[{"numer_id": "us.census.acs.B01003001", "normalization": "predenom", "target_area": ' || sumarea || '}]')::JSON,
+    1, 1, cnt) meta
+  FROM summary
+  GROUP BY sample, extent, cnt, sumarea
+), results AS (
+  SELECT summary.sample, id, meta->0->>'geom_id' geom_id, (data->0->>'value')::Numeric as val
+  FROM summary, meta, LATERAL cdb_observatory.OBS_GetData(geomvals, meta) data
+  WHERE summary.sample = meta.sample
+) SELECT sample tract_sample
+ , MAX(100 * abs((geoms.total_pop - val) / Coalesce(NullIf(total_pop, 0), NULL)))::Numeric(10, 2) < 10 tract_max_error
+ , AVG(100 * abs((geoms.total_pop - val) / Coalesce(NullIf(total_pop, 0), NULL)))::Numeric(10, 2) < 10 tract_avg_error
+ , MIN(100 * abs((geoms.total_pop - val) / Coalesce(NullIf(total_pop, 0), NULL)))::Numeric(10, 2) < 10 tract_min_error
+FROM geoms, results
+WHERE cartodb_id = id
+GROUP BY sample
+ORDER BY sample
+;
+
+-- Check that random assortment of block group points in Brooklyn return accurate data
+WITH _geoms AS (
+  SELECT
+    ST_PointOnSurface((data->0->>'value')::geometry) the_geom,
+    data->0->>'geomref' geom_ref,
+    (data->1->>'value')::numeric total_pop
+  FROM cdb_observatory.OBS_GetData(
+    array[(st_buffer(cdb_observatory._testpoint(), 0.2), 1)::geomval],
+    (SELECT cdb_observatory.OBS_GetMeta(ST_MakeEnvelope(-179, 89, 179, -89, 4326),
+      '[{"geom_id": "us.census.tiger.block_group"},
+        {"numer_id": "us.census.acs.B01003001", "geom_id": "us.census.tiger.block_group", "normalization": "predenom"}]')),
+    FALSE
+  )
+  WHERE data->0->>'geomref' LIKE '36047%'
+), geoms AS (
+  SELECT *, row_number() OVER () cartodb_id FROM _geoms
+), samples AS (
+  SELECT COUNT(*) cnt, unnest(ARRAY[1, 2, 3, 5, 10, 25, 50, 100, COUNT(*)]) sample FROM geoms
+), filtered AS (
+  SELECT * FROM geoms, samples WHERE cartodb_id % (cnt / sample) = 0
+), summary AS (
+  SELECT sample, ST_SetSRID(ST_Extent(the_geom), 4326) extent,
+    COUNT(*)::INT cnt,
+    ARRAY_AGG((the_geom, cartodb_id)::geomval) geomvals,
+    SUM(ST_Area(the_geom))::Numeric sumarea
+  FROM filtered
+  GROUP BY sample
+), meta AS (
+  SELECT sample, cdb_observatory.OBS_GetMeta(extent,
+    ('[{"numer_id": "us.census.acs.B01003001", "normalization": "predenom", "target_area": ' || sumarea || '}]')::JSON,
+    1, 1, cnt) meta
+  FROM summary
+  GROUP BY sample, extent, cnt, sumarea
+), results AS (
+  SELECT summary.sample, id, meta->0->>'geom_id' geom_id, (data->0->>'value')::Numeric as val
+  FROM summary, meta, LATERAL cdb_observatory.OBS_GetData(geomvals, meta) data
+  WHERE summary.sample = meta.sample
+) SELECT
+ BOOL_AND(abs((geoms.total_pop - val) /
+      Coalesce(NullIf(total_pop, 0), 1)) = 0) is True no_bg_point_error
+FROM geoms, results
+WHERE cartodb_id = id
+;
--- a/src/pg/test/sql/42_observatory_exploration_test.sql
+++ b/src/pg/test/sql/42_observatory_exploration_test.sql
@@ -360,9 +360,9 @@ SELECT ARRAY_AGG(column_id ORDER BY score DESC) =
        'us.census.tiger.county', 'us.census.tiger.zcta5'])
      WHERE table_id LIKE '%2015%';

-SELECT ARRAY_AGG(column_id ORDER BY score DESC) =
-       ARRAY['us.census.tiger.block_group', 'us.census.tiger.census_tract',
-             'us.census.tiger.zcta5', 'us.census.tiger.county']
+SELECT ARRAY_AGG(column_id ORDER BY score DESC)
+       = ARRAY['us.census.tiger.block_group', 'us.census.tiger.census_tract',
+             'us.census.tiger.county', 'us.census.tiger.zcta5']
       AS _obs_geometryscores_5km_buffer
       FROM cdb_observatory._OBS_GetGeometryScores(
  ST_Buffer(ST_SetSRID(ST_MakePoint(-73.9, 40.7), 4326)::Geography, 5000)::Geometry(Geometry, 4326),
@@ -390,60 +390,55 @@ SELECT ARRAY_AGG(column_id ORDER BY score DESC) =
        'us.census.tiger.zcta5', 'us.census.tiger.county'])
      WHERE table_id LIKE '%2015%';

-SELECT ARRAY_AGG(column_id ORDER BY score DESC) =
-       ARRAY['us.census.tiger.county', 'us.census.tiger.zcta5',
-             'us.census.tiger.census_tract', 'us.census.tiger.block_group']
+SELECT ARRAY_AGG(column_id ORDER BY score DESC)
+       = ARRAY['us.census.tiger.county', 'us.census.tiger.census_tract',
+             'us.census.tiger.zcta5', 'us.census.tiger.block_group']
      AS _obs_geometryscores_2500km_buffer
      FROM cdb_observatory._OBS_GetGeometryScores(
  ST_Buffer(ST_SetSRID(ST_MakePoint(-73.9, 40.7), 4326)::Geography, 2500000)::Geometry(Geometry, 4326),
-  ARRAY['us.census.tiger.block_group', 'us.census.tiger.census_tract',
-        'us.census.tiger.zcta5', 'us.census.tiger.county'])
+  ARRAY['us.census.tiger.county', 'us.census.tiger.census_tract',
+        'us.census.tiger.zcta5', 'us.census.tiger.block_group'])
      WHERE table_id LIKE '%2015%';

-SELECT JSON_Object_Agg(column_id, numgeoms::int ORDER BY numgeoms DESC)::Text
-      = '{ "us.census.tiger.block_group" : 9, "us.census.tiger.census_tract" : 3, "us.census.tiger.zcta5" : 0, "us.census.tiger.county" : 0 }'
-      AS _obs_geometryscores_numgeoms_500m_buffer
+SELECT column_id, numgeoms::int AS _obs_geometryscores_numgeoms_500m_buffer
      FROM cdb_observatory._OBS_GetGeometryScores(
  ST_Buffer(ST_SetSRID(ST_MakePoint(-73.9, 40.7), 4326)::Geography, 500)::Geometry(Geometry, 4326),
  ARRAY['us.census.tiger.block_group', 'us.census.tiger.census_tract',
        'us.census.tiger.zcta5', 'us.census.tiger.county'])
-      WHERE table_id LIKE '%2015%';
+      WHERE table_id LIKE '%2015%'
+      ORDER BY numgeoms DESC;

-SELECT JSON_Object_Agg(column_id, numgeoms::int ORDER BY numgeoms DESC)::Text =
-      '{ "us.census.tiger.block_group" : 880, "us.census.tiger.census_tract" : 310, "us.census.tiger.zcta5" : 45, "us.census.tiger.county" : 1 }'
-      AS _obs_geometryscores_numgeoms_5km_buffer
+SELECT column_id, numgeoms::int AS _obs_geometryscores_numgeoms_5km_buffer
      FROM cdb_observatory._OBS_GetGeometryScores(
  ST_Buffer(ST_SetSRID(ST_MakePoint(-73.9, 40.7), 4326)::Geography, 5000)::Geometry(Geometry, 4326),
  ARRAY['us.census.tiger.block_group', 'us.census.tiger.census_tract',
        'us.census.tiger.zcta5', 'us.census.tiger.county'])
-      WHERE table_id LIKE '%2015%';
+      WHERE table_id LIKE '%2015%'
+      ORDER BY numgeoms DESC;

-SELECT JSON_Object_Agg(column_id, numgeoms::int ORDER BY numgeoms DESC)::Text =
-      '{ "us.census.tiger.block_group" : 11531, "us.census.tiger.census_tract" : 3601, "us.census.tiger.zcta5" : 550, "us.census.tiger.county" : 14 }'
-      AS _obs_geometryscores_numgeoms_50km_buffer
+SELECT column_id, numgeoms::int AS _obs_geometryscores_numgeoms_50km_buffer
      FROM cdb_observatory._OBS_GetGeometryScores(
  ST_Buffer(ST_SetSRID(ST_MakePoint(-73.9, 40.7), 4326)::Geography, 50000)::Geometry(Geometry, 4326),
  ARRAY['us.census.tiger.block_group', 'us.census.tiger.census_tract',
        'us.census.tiger.zcta5', 'us.census.tiger.county'])
-      WHERE table_id LIKE '%2015%';
+      WHERE table_id LIKE '%2015%'
+      ORDER BY numgeoms DESC;

-SELECT JSON_Object_Agg(column_id, numgeoms::int ORDER BY numgeoms DESC)::Text =
-      '{ "us.census.tiger.block_group" : 48917, "us.census.tiger.census_tract" : 15969, "us.census.tiger.zcta5" : 6534, "us.census.tiger.county" : 314 }'
-      AS _obs_geometryscores_numgeoms_500km_buffer
+SELECT column_id, numgeoms::int AS _obs_geometryscores_numgeoms_500km_buffer
      FROM cdb_observatory._OBS_GetGeometryScores(
  ST_Buffer(ST_SetSRID(ST_MakePoint(-73.9, 40.7), 4326)::Geography, 500000)::Geometry(Geometry, 4326),
  ARRAY['us.census.tiger.block_group', 'us.census.tiger.census_tract',
        'us.census.tiger.zcta5', 'us.census.tiger.county'])
-      WHERE table_id LIKE '%2015%';
+      WHERE table_id LIKE '%2015%'
+      ORDER BY numgeoms DESC;

-SELECT JSON_Object_Agg(column_id, numgeoms::int ORDER BY numgeoms DESC)::Text =
-      '{ "us.census.tiger.block_group" : 169191, "us.census.tiger.census_tract" : 56469, "us.census.tiger.zcta5" : 26525, "us.census.tiger.county" : 2753 }'
-      AS _obs_geometryscores_numgeoms_2500km_buffer
+SELECT column_id, numgeoms::int AS _obs_geometryscores_numgeoms_2500km_buffer
      FROM cdb_observatory._OBS_GetGeometryScores(
  ST_Buffer(ST_SetSRID(ST_MakePoint(-73.9, 40.7), 4326)::Geography, 2500000)::Geometry(Geometry, 4326),
  ARRAY['us.census.tiger.block_group', 'us.census.tiger.census_tract',
        'us.census.tiger.zcta5', 'us.census.tiger.county'])
-      WHERE table_id LIKE '%2015%';
+      WHERE table_id LIKE '%2015%'
+      ORDER BY numgeoms DESC;

 SELECT ARRAY_AGG(column_id ORDER BY score DESC) =
       ARRAY['us.census.tiger.county', 'us.census.tiger.zcta5',
@@ -475,9 +470,9 @@ SELECT ARRAY_AGG(column_id ORDER BY score DESC) =
        'us.census.tiger.zcta5', 'us.census.tiger.county'], 2500)
      WHERE table_id LIKE '%2015%';

-SELECT ARRAY_AGG(column_id ORDER BY score DESC) =
-       ARRAY['us.census.tiger.block_group', 'us.census.tiger.census_tract',
-             'us.census.tiger.zcta5', 'us.census.tiger.county']
+SELECT ARRAY_AGG(column_id ORDER BY score DESC)
+       = ARRAY['us.census.tiger.block_group', 'us.census.tiger.census_tract',
+               'us.census.tiger.county', 'us.census.tiger.zcta5']
      AS _obs_geometryscores_500km_buffer_25000_geoms
      FROM cdb_observatory._OBS_GetGeometryScores(
  ST_Buffer(ST_SetSRID(ST_MakePoint(-73.9, 40.7), 4326)::Geography, 50000)::Geometry(Geometry, 4326),
@@ -485,6 +480,44 @@ SELECT ARRAY_AGG(column_id ORDER BY score DESC) =
        'us.census.tiger.zcta5', 'us.census.tiger.county'], 25000)
      WHERE table_id LIKE '%2015%';

+-- Check that one small geom approximates tract data
+WITH geoms AS (SELECT cdb_observatory._testarea() the_geom),
+summary AS (SELECT ST_SetSRID(ST_Extent(the_geom), 4326) extent,
+                   COUNT(*)::INT cnt,
+                   SUM(ST_Area(the_geom))::Numeric sumarea
+            FROM geoms)
+SELECT column_id = 'us.census.tiger.census_tract' testarea_uses_tract
+FROM summary, LATERAL (
+  SELECT *
+  FROM cdb_observatory._OBS_GetGeometryScores(extent,
+  ARRAY['us.census.tiger.block_group', 'us.census.tiger.census_tract',
+        'us.census.tiger.zcta5', 'us.census.tiger.county'],
+        cnt, sumarea)) foo
+ORDER BY score DESC LIMIT 1;
+
+-- Check that randomly distributed points always use smallest geometry if we
+-- order by numgeoms desc
+WITH geoms as (SELECT UNNEST(ARRAY[
+    cdb_observatory._testpoint(),
+    st_translate(cdb_observatory._testpoint(), -0.003, 0),
+    st_translate(cdb_observatory._testpoint(), -0.006, 0)
+]) the_geom),
+summary as (SELECT
+  ST_SetSRID(ST_Extent(the_geom), 4326) extent,
+  SUM(ST_Area(the_geom))::Numeric area,
+  COUNT(*)::INTEGER cnt
+  FROM geoms
+)
+SELECT column_id = 'us.census.tiger.block_group' points_use_bg
+      FROM summary, LATERAL (
+        SELECT * FROM cdb_observatory._OBS_GetGeometryScores(
+          extent,
+          ARRAY['us.census.tiger.block_group', 'us.census.tiger.census_tract',
+                'us.census.tiger.zcta5', 'us.census.tiger.county'],
+        cnt, area)) foo
+      WHERE table_id LIKE '%2015%'
+      ORDER BY numgeoms DESC LIMIT 1;
+
 --
 -- OBS_LegacyBuilderMetadata tests
 --
Author	SHA1	Message	Date
John Krauss	536af5e4a2	release artifact	2017-03-22 15:17:19 +00:00
John Krauss	ebf23d2a23	Merge branch 'develop' into release-v-1.4.0	2017-03-22 15:16:35 +00:00
John Krauss	f1afcf0d8e	update NEWS.md	2017-03-22 15:14:35 +00:00
John Krauss	3c0b40cf3f	more consistent arguments in docs	2017-03-22 15:12:50 +00:00
John Krauss	8a87dc7e9a	update NEWS.md	2017-03-21 21:24:50 +00:00
John Krauss	61552adba4	Allow for target_geoms and target_area override on column-by-column basis	2017-03-21 17:26:02 +00:00
csobier	36abbee64f	Merge pull request #274 from CartoDB/273-docs-edit clarification of docs for obs_getboundariesbygeometry function	2017-03-17 12:07:48 -04:00
csobier	5a76a7381e	clarification of docs for obs_getboundariesbygeometry function	2017-03-17 11:45:49 -04:00
John Krauss	217ca2d84d	release 1.3.5 artifact	2017-03-15 20:12:06 +00:00