Additional Notes:
File / Folder naming | – Each delivery will be in a folder named as per “as-at” date – eg “2016-06-01” – Sample / test files will be in separate dated folders – eg “sample_2016_06_01” |
Compression | – Files will be compressed in gzip format, with “.gz” extension |
md5sum | – File named “md5sum” will be added showing a MD5 checksum for all gzipped files – this helps check for any integrity issues during download |
File transfer mechanism | – SFTP – User authentication: choice of password or private-key certificate |
Format of files | – CSV, conformant to RFC 4180 https://tools.ietf.org/html/rfc4180 – Linebreaks in text indicated with \n (common in address fields) – Unix line endings |
Column separator | , (comma) |
Fields enclosed by | Optionally enclosed by double quotes, e.g when field contains a comma e.g: xxx,”Test Company, LLC”,yyy |
Escape character | Double quotes in text are escaped with double quotes as per RFC 4180 e.g: xxx,”test “”hello”” company llc”,yyy |
Encoding | UTF-8 |
Data Types
varchar(255) | string up to 255 chars in length |
text | string up to 65,535 bytes in length – NB: UTF8 consumes 1-3 bytes per character |
mediumtext | string up to 16MB bytes in length – NB: UTF8 consumes 1-3 bytes per character |
longtext | string up to 4GB bytes in length – NB: UTF8 consumes 1-3 bytes per character |
pipe separated fields | Unable to determine field length. Data in this field can consist of 0, 1 or multiple instances of an underlying field. so data may need to be normalised during ETL |
boolean | contains: true, false, or is left blank to signify “unknown” state |
Note [1] | Data in this field is combined with other data attributes in YAML serialized format, and is stored in a single mediumtext MySQL field. This allows OpenCorporates to handle a wide variety of string lengths from multiple data sources |
Note [2] | Data in this field is combined with other data attributes in a ruby object, and is stored in a single longtext MySQL field. This allows OpenCorporates to handle a wide variety of string lengths from multiple data sources |