File formats for long-term digital records
The file format you choose will affect how records are preserved and managed. The choice of format becomes more critical the longer a record has to be kept.
Consider the purpose of the record, the retention period, the risk of the relevant format and how widely used and accepted the format is.
If you are digitising original physical records, you may need to determine the best file format to use for the digitised version.
Avoid high risk file formats where possible, including those:
- that are or will soon be obsolete
- that are no longer supported by the developer
- where the developer will not share information about the format
- that use 'lossy' compression techniques
- accessed or read with unsupported hardware or software
- restricted by intellectual property or that use digital rights management.
File formats for long-term temporary and permanent records should be:
- based on open, documented standards, particularly those developed by standards organisations
- an open or open propriety format as opposed to closed proprietary
- developed by a community rather than by single vendor
- portable (can be independent of specific hardware, operating systems and software)
- commonly used (at least within a specific community of practice)
- not encumbered by intellectual property restrictions
- uncompressed or use lossless compression
You can use proprietary formats for short and medium-term temporary records (e.g. 10 years) where necessary. Consider how widely used and accepted the format is, and how long it's been in use.
You may need to assess the risk of the format to determine its suitability.
The suitability of a particular format can be reassessed when preservation activities are done (e.g. format migration or refreshing storage media) to ensure it's still widely accepted and used and the record will continue to remain accessible.
|File type||Open formats||Open proprietary formats||Closed proprietary formats|
|Word processing||OpenDocument Text (.odt)|
|Spreadsheet||OpenDocument Spreadsheet (.ods)|
|Presentation||OpenDocument Presentation (.odp)|
(including digital photographs)
|Document exchange||Portable Document Format (PDF)||Open XML Paper Specification (XPS)|
|Calendars||Internet Calendaring and Scheduling Core Object Specification (ICS)||Outlook Personal Information Store File (PST)|
|Vector graphics||AutoCAD Drawing Exchange Format (.dxf)|
|Graphics metafiles||Computer Graphics Metafile (.cgm)||Windows Enhanced Metafile (.emf)|
|Audio||MPEG-2 Audio Layer 3 (MP3)|
Note: Microsoft Office formats after 2003 and Microsoft Office Open XML formats are not supported by the Queensland Government Enterprise Architecture due to archiving and interchangeability issues. For more information, refer to the QGEA office suites policy.
Open formats are low-risk. They:
- are supported by a wide range of software or are platform independent
- can be used and changed by anyone without restrictions (except for licensing conditions that may limit development of commercialised versions of software)
- support long-term data sustainability by allowing migration from one technical environment to another, without locking into a specific vendor
- are free to be implemented by anyone, including both proprietary and free and open source software, using the typical licences deployed by each
- are more likely to be 'future proofed' by the development community
- are usually developed through a publicly visible, community-driven process with publicly available intellectual property and format specifications.
Open proprietary formats
Open proprietary formats:
- are moderate—medium risk because they are controlled by a corporate entity under licensing arrangements that may change
- can usually only be used by licensed applications
- are developed by companies, possibly in consultation with a user community
- may share the intellectual property and technical specifications, but there may be restrictions.
Closed proprietary formats
Closed proprietary formats are medium—high risk file formats and should not be used unless necessary. They:
- carry greater risk to long-term data accessibility because the licence holders control the way the technology is used to the (current or future) exclusion of others
- can only be accessed using the software that produced that file, or licensed applications
- do not share the intellectual property or technical specifications
- are not future-proof because the specification and licensing requirements are not publicly available.
A record in a closed proprietary format may become lost or inaccessible when the format changes.
- the degree of use, both worldwide and within a community of practice
- dependencies on particular hardware or software and level of interoperability
- the openness and public standardisation of format specifications, and impact of patents
- the transparency of the format, including compression and human readability
- available metadata support, including self-documentation
- whether the formats include digital rights management and the degree of technical protection and encryption involved
- the level of backwards and forwards compatibility, and compatibility with other software or hardware
- the robustness and complexity of the format
- the ability to manipulate and reuse file contents
- whether it uses 'lossy' compression (higher risk, smaller file size) or lossless compression (lower risk), or is an uncompressed file format (lowest risk, larger file size)
- whether it is an open, open proprietary or closed proprietary format.
Compression reduces the size of a file to enable efficient storage and access.
Records should not be stored in a compressed format. Compression can reduce the quality to a point where it becomes unusable and required information is obscured.
Compression techniques can be categorised as either lossy or lossless. If compression is needed, use lossless compression for original versions of images, video and documents as no information is irretrievably lost during the process.
Lossy compression should only be used for working versions, where the original version is to be retained. Lossy compression is not reversible and information (e.g. metadata) will be permanently lost in the process.
When using lossy compression, the resulting image or document should not appear noticeably different from the original.
If you are digitising or have digitised original paper records and are retaining the originals, the additional file size reduction that lossy compression provides can mean that a small or indistinguishable loss of data may be acceptable for some file types.
Lossy compression should not be used when digitising if the original paper records are to be destroyed as the accuracy of the image may be called into question.
File formats can be identified using filename extensions or by using one of these free tools:
- XENA (XML Electronic Normalising for Archives) developed by the National Archives of Australia
- DROID (Digital Record Object Identification) developed by The National Archives UK
- JHOVE2 developed by the Library of Congress
- Metadata extraction tool developed by the National Library of New Zealand.
The PRONOM technical registry includes advice on the level of adoption and support, degree of disclosure, quality of documentation, sustainability, ease of identification, intellectual property rights, metadata support, complexity, interoperability and reusability of many formats.
See also the guide on sustainable digital file formats for creating and using records (PDF, 4.7MB), developed by the Australasian Digital Recordkeeping Initiative (ADRI) and Council of Australasian Archives and Records Authorities (CAARA).