Address Parsing

OpenCorporates now provides parsed-address data for all US state jurisdictions as an add-on to your bulk data deliveries, giving you access to addresses in a consistent, machine-readable format, with components that can be formatted to suit your needs.

What is an Address?

An address is information that represents a real world location. It is most often presented as a free-form text string that is easily (usually!) interpretable by a human reader. However, due to the open free-form nature of address data it can make it impossible to index / query effectively for machines using standard approaches. Thus, if you run a business that leverages address data in some way you will have to account for the inherent structure of addresses. This naturally leads to the notion of address parsing.

If we assume that an address is actually composed of distinct attributes, we can try to break an address down into these components to make them machine readable and indexable. This breaking down into components is what parsing means, such components could be House Number, Country or Postal Code etc. Parsing in this way would allow much easier querying and comparisons to take place using address data. However, trying to do this reliably, accurately and at scale is very challenging for a number of reasons.

Challenges

  • Local Context – Addresses are written slightly differently all over the world, it really depends on location and the corresponding local context, especially around the formation and evolution of postal services.
  • Variety of Conventions – There are multiple ways to write the same address, such as ‘W 110 St, NY, NY, US’ is equivalent to ‘West 110th Street, New York, NY, USA’ . Which one is more correct if they convey the same location? Add in PO numbers, house names, house numbers, flat numbers, unit numbers, staircase vs elevation level, access points vs entrance name, suburbs or city districts etc and you can really have an incredible amount of variation to generalise a solution to!
  • Languages – Address information will be in language of origin country, and the street names / house names tend to reflect the language most clearly. Developing a solution that can work over a large amount of languages is also going to be difficult.
  • Human Errata – Data collected from humans will have mistakes and erroneous information throughout. This could be typos, duplicates or just incorrect info. Any parser needs to be able to handle this as best it can.
  • Noise – Raw address data will also have noise present in the strings, which can make machine readability a bit harder. A common example would be escape or new line characters ‘\n’ for UK addresses, as it is the standard way of writing each different address component where each field typically gets a new line.
  • Data Volume – Building a trained Machine Learning address parser that generalises well over many different address types, in many different languages, requires a huge amount of data (e.g. hundred of millions of labelled examples). Thankfully, there are open source solutions available that have went down this route already. On the other side, it can be hard to process and properly evaluate large volumes of address information at scale.
  • Interpretation – Evaluating and interpreting how ‘correct’ one parsed address component is to another is not always straightforward. This is mostly a problem in the street address attribute, as there are multiple sub components and variations that appear here, while still conveying the same physical location perfectly well.

OC Address Data

OpenCorporates have an excellent data asset in address level information for companies and officers. Often, it is still in its raw free-form string format and not split out into distinct components, which makes it hard to index or query effectively. This happens because the registries we pull the data from often don’t provide the explicit component fields (e.g. postcode, street etc) but instead provide only the full address. While we strive to provide this information at its rawest, truest nature as shown at source, it can push unwanted data processing problems downstream onto our customers. In order to remedy this and make address level information more accessible for all, we have taken steps to clean and parse the raw address data internally. This means we will provide new additional address attributes, derived from the source data, alongside the existing raw registry information so clients can still trust our data provenance but also gain from clean, parsed address components. The next section describes how we tackle the problem and outlines our current solution.

Address Parsing Model Approach

We develop a straightforward approach that takes a take a raw free-form string address as input and return a dictionary of parsed address components, where the dictionary conforms to our address mapping scheme.

Our approach can be described as the following multi-stage process:

  1. Preprocess Stage – clean and normalise input string to improve machine readability and any further processes downstream
    1. Remove large whitespace blocks
    2. Strip single and double newline characters
    3. Remove errant commas
    4. Remove adjacent exact duplicate tokens
  2. ML Parser Model – apply trained Machine Learning address parsing model libpostal to detect distinct address components from cleaned, full form address
  3. Check Output Components – review libpostal output and adjust / remap components as required
    1. Check country codes have been captured correctly (USA or United States → US for example)
    2. Query country code table offline and fill in blanks
    3. Remap and clean parser output as desired
Figure 1: Simple flow chart to illustrate stages of our approach

Output Fields

We have two main vehicles to deliver parsed address information – API & Bulk files. The attribute naming convention only serves as an easy way to identify data science derived fields, they do not represent a finalised schema by any means and subsequent future work will enable re-mappings to desired address schemes (such as United States Postal Service for instance). For both, new parsed attributes are prepended by oc_ to make it clear they are derived from raw input address information

Bulk Files & External API: we will always create new fields instead of transforming the source in situ and append these as new columns into the existing bulk files. We take address.in_full field as input string and generate a new, clean string which we then process with a trained machine learning parser. The output fields are as follows:

  • oc_clean_address_string – preprocessed version of the raw input address string
  • oc_street_address – street level address, usually composed of house name or number and road or street name. Lot’s of variation present in this field as lots of potential subcomponents
  • oc_high_level_street_address – same as street level but with less granularity, usually only contains street or road name.
  • oc_locality – locale is often a name for a human settlement, such as village, town or city etc
  • oc_region – region is a larger notion of state level administration, such as a borough, county, district or even state
  • oc_postal_code – standard postcode
  • oc_country2 – alpha two letter country code conforming to ISO standards List of country codes by alpha-2, alpha-3 code (ISO 3166)
  • oc_country3 – alpha three letter country code conforming to ISO standards
  • oc_country_full – full country name
  • oc_provenance_metadata – an attribute that shows what model version processed data and at what time. Represented as <model-tag>M:<timestamp>, where we use the model tag appended with an ISO timestamp

Parsed Example

Let’s take a real world address from one of our registry datasets and parse it using our internal API endpoint for OC address parsing. The response dictionary is provided below to illustrate expected output:

{
  "input": "970 W. Broadway\\\\n#422\\\\nJackson, WY 83001\\\\nUSA, United States",
  "oc_address_components": {
    "oc_country2": "US",
    "oc_country3": "USA",
    "oc_country_full": "United States",
    "oc_high_level_street_address": "970 W. Broadway #422",
    "oc_locality": "Jackson",
    "oc_postal_code": "83001",
    "oc_provenance_metadata": "1.0.2M:2023-09-20T12:56:35.228857+00:00",
    "oc_region": "WY",
    "oc_street_address": "970 W. Broadway #422"
  },
  "oc_clean_address_string": "970 W. Broadway, #422, Jackson, WY 83001, USA, United States"
}

General Notes & Observations

Solution is currently only applicable to US based jurisdictions, demonstrating highly accurate results on our internal experiments. We will continue to develop the parsing solution and integrate improvements into our products over time.

Updated on December 14, 2023

Was this article helpful?

Related Articles

Need Support?
Can’t find the answer you’re looking for? Don’t worry we’re here to help!
Contact Support