Data, in its natural state ("Raw Data"), often have recording errors that make it difficult to analyze. As they are recorded by different systems and people, it is normal that we end up with a file in which the same value is expressed in different ways (for example, a date may be recorded as June 28, or as 28/06), as there may be blank records and of course, grammatical errors.
"Data Wrangling" is a frequently used term in different Data Science processes and is used to define the procedure that consists of extracting, transforming and mapping information.
When analyzing these data, all these records must be preprocessed. In other words, they must be cleaned, unified, consolidated and normalized so that they can be used to extract valuable information. This is what Data Wrangling is all about, preparing the data to be used.
Commonly, data from files and catalogs are stored in comma separated files (CSV), and to a lesser extent separated by a tabulator (TSV).
For this task we mainly use three tools:
Data Wrangling | Open Refine |
Spreadsheet | Google Sheets, Libre Office, Microsoft Excel and others. |
File renaming | Transnomio, A Better Finder Rename, KRename, Bulk Rename Utility and others. |
For this manual we created a 'spreadsheet' with a set of 'VRA Core' and 'Dublin Core' tags and data, coming from the 'Bruzzone Collection' that will serve as a guide: Museum Metadata Embedder dataset, which is not only used to contain the data, but also to apply various formulas and support sheets and string translation.
Comment permissions are enabled; you are free to download it or comment on it.
The Bruzzone Collection catalog uses the following convention:
Example: AR-MA-Bruzzone-Agesta-Antonella-Je-suis-la-2014-00-EN.jpg