Data Wrangling

Data, in its natural state ("Raw Data"), often have recording errors that make it difficult to analyze. As they are recorded by different systems and people, it is normal that we end up with a file in which the same value is expressed in different ways (for example, a date may be recorded as June 28, or as 28/06), as there may be blank records and of course, grammatical errors.

"Data Wrangling" is a frequently used term in different Data Science processes and is used to define the procedure that consists of extracting, transforming and mapping information.

When analyzing these data, all these records must be preprocessed. In other words, they must be cleaned, unified, consolidated and normalized so that they can be used to extract valuable information. This is what Data Wrangling is all about, preparing the data to be used.

Commonly, data from files and catalogs are stored in comma separated files (CSV), and to a lesser extent separated by a tabulator (TSV).

For this task we mainly use three tools:

Data Wrangling Open Refine
Spreadsheet Google Sheets, Libre Office, Microsoft Excel and others.
File renaming Transnomio, A Better Finder Rename, KRename, Bulk Rename Utility and others.

OpenRefine

Selected resources

Google Sheets

For this manual we created a 'spreadsheet' with a set of 'VRA Core' and 'Dublin Core' tags and data, coming from the 'Bruzzone Collection' that will serve as a guide: Museum Metadata Embedder dataset, which is not only used to contain the data, but also to apply various formulas and support sheets and string translation.
Comment permissions are enabled; you are free to download it or comment on it.

File naming convention

The Bruzzone Collection catalog uses the following convention:

  1. Country: AR
  2. Institution: MA (Museos Abiertos)
  3. Collection: Bruzzone
  4. Artist: Agerta Antonella (Apellido, Nombre)
  5. Work Name: Je suis la
  6. ID object: 2014
  7. Object numbering: 00
  8. Language: EN

Example: AR-MA-Bruzzone-Agesta-Antonella-Je-suis-la-2014-00-EN.jpg

  • Start with general information (on the left) and get more specific as you move through your file name, just as you do in your folder structure. This helps your files to be ordered logically, from top to bottom.
  • Consider including a general prefix (customer, product) and/or a specific suffix (version number, color).
  • Keep your abbreviations short but meaningful, 2-3 letters if possible, as long as they have a common sense meaning.
  • Use underscores, hyphens, or upper or lower case letters to help with legibility, and do not use spaces.
  • Dots should only be used to separate the file name from the format extension (e.g. logo.jpg), never in the file name itself.
  • Dates should be in ISO 8601 'year-month-day' format, so that files are sorted chronologically. Example: 2022-12-31
  • When applying file versions, use the designator "v" or "V" and a number, e.g. "v01".
  • Avoid special characters (< > | [ ] & $ + : * ? ") to make your filenames usable on the web and compatible across platforms.
  • Avoid file names that are too long. For example, the Windows API imposes a maximum filename length such that a filename, including the path to the file, cannot exceed between 255 and 260 characters.

Resources