Additional Notes:
| File / Folder naming | – Each delivery will be in a folder named as per “as-at” date – eg “2016-06-01” – Sample / test files will be in separate dated folders – eg “sample_2016_06_01” |
| Compression | – Files will be compressed in gzip format, with “.gz” extension |
| md5sum | – File named “md5sum” will be added showing a MD5 checksum for all gzipped files – this helps check for any integrity issues during download |
| File transfer mechanism | – SFTP – User authentication: choice of password or private-key certificate |
| Format of files | – CSV, conformant to RFC 4180 https://tools.ietf.org/html/rfc4180 – Linebreaks in text indicated with \n (common in address fields) – Unix line endings |
| Column separator | , (comma) |
| Fields enclosed by | Optionally enclosed by double quotes, e.g when field contains a comma e.g: xxx,”Test Company, LLC”,yyy |
| Escape character | Double quotes in text are escaped with double quotes as per RFC 4180 e.g: xxx,”test “”hello”” company llc”,yyy |
| Encoding | UTF-8 |
Data Types
| varchar(255) | string up to 255 chars in length |
| text | string up to 65,535 bytes in length – NB: UTF8 consumes 1-3 bytes per character |
| mediumtext | string up to 16MB bytes in length – NB: UTF8 consumes 1-3 bytes per character |
| longtext | string up to 4GB bytes in length – NB: UTF8 consumes 1-3 bytes per character |
| pipe separated fields | Unable to determine field length. Data in this field can consist of 0, 1 or multiple instances of an underlying field. so data may need to be normalised during ETL |
| boolean | contains: true, false, or is left blank to signify “unknown” state |
| Note [1] | Data in this field is combined with other data attributes in YAML serialized format, and is stored in a single mediumtext MySQL field. This allows OpenCorporates to handle a wide variety of string lengths from multiple data sources |
| Note [2] | Data in this field is combined with other data attributes in a ruby object, and is stored in a single longtext MySQL field. This allows OpenCorporates to handle a wide variety of string lengths from multiple data sources |