Jump to content

OpenRefine

From Wikimedia Belgium

OpenRefine is a powerful ETL data manipulation tool to validate, prepare data to be uploaded to, upload, verify, or retrieve from Wikidata, and Wikimedia Commons.

  • For Wikidata, it allows to reconcile and upload statements, based on rows and data columns corresponding to properties.
  • For Wikimedia Commons, it allows to upload pictures, and register media metadata.

Multiple plugins exist. You can import and export data from and to multiple formats.

It is much more powerful and flexible than Excel, because it interfaces with external systems, but a bit more difficult to use. It is using open standards.

It has a read and write interface to Wikidata, and Wikimedia Commons (SDC). It can export to QuickStatements for large batch operations, or you can choose to amend Wikidata interactively.

What is it?

[edit]

Originally called Google Refine, an automated tool to manipulate lists of data, based on rows and columns, developed by Google. It is a front-end to Wikidata and Wikimedia Commons.

It allows to import and export to other formats like Excel, CSV, Google Spreadsheets, etc.

Projects can be exported and imported, so that they can be used on multiple devices.

Install the tool

[edit]

Read first:

Platforms

[edit]
  • Windows: 2 methods:
    • Download the install kit. Start from the menu-bar.
    • Download the zip file. Unzip in a subdirectory. Possibly install the Java Run-time Environment (JRE). Start from Explorer.
sudo apt install default-jre

Runtime environment

[edit]

Run

[edit]
  • On Windows: Start a localhost webserver, the workstation will automatically open a browser.
    • Click on OpenRefine on the desktop
    • Run openrefine.exe from Explorer
  • On Linux or Chromebook:
    • Run the shell script "refine" (backend)
    • A localhost webserver will be started
    • Open the application in the browser: http://127.0.0.1:3333 (frontend)

Storage

[edit]

Local storage physically on the client device. The location depends on the platform.

Transfer

[edit]

You can transfer a project from one platform to another (project export/import).

Stop

[edit]

Save your data. Close the command window, or exit the application (^c).

Functionality

[edit]
Presentation OpenRefine
  • Data import (multiple targets)
    • Input from CSV, Excel, Google Spreadsheet, XML file, paste buffer (very practical)
    • Skip empty rows
  • You can undo all of the following actions.
  • Selectively delete rows based on facets (filter query)
  • Add, delete, rename, or move columns.
  • Cleanup: merge data, detect and correct outliers
  • Transform: split strings into columns, concatenate columns, based on GREL = General Refine Expression Language
  • Reconcile: validate and get ID data from e.g. Wikidata
    • Choose an instance, or reconcile against no particular type
    • Verify if you got the right item (check homonyms based upon descriptions or statements)
    • Choose the right homonym
    • Flag the item for creation if it does not exist yet
  • Create columns based on Wikidata statements from reconciled items
  • Enrich: get additional data by ID from external databases
  • Verify the data quality
  • Create Wikidata schemas (prepare data upload: item labels, descriptions, alias, statements)
  • Upload to Wikidata
    • Pay attention not to create duplicate items or statements
      • create prerequisite items
      • first amend existing items
        • create a minimum list of statements
      • then create new items; multiple targets are possible (use multiple facets when required)
        • add all required statements => no risk for duplicate statements since new items are created
  • Upload images and photo metadata to Wikimedia Commons; add P180 Depicts statements to backlink to Wikidata items.
  • Extensions
[edit]
  • Reconcile; the system will propose a default property (based on instance occurrence).
  • Reconcile against no type (e.g. when inputting Q-numbers)
  • Convert to text, before uploading a date
  • Special string format for latitude, longitude, e.g. Belgische militaire begraafplaats van Hoogstade: 50.98740, 2.68670 (5 decimals have an accuracy of 1 m => 1° ~ 111 km)
  • White space (should be handled by Wikibase, but it is not?) -- perform manual after importing the sheet for all columns:
    • Trim leading and trailing spaces
    • Collapse repeated white space

GREL functions

[edit]

GREL = General Refine Expression Language:

Replace special characters

[edit]
value.replace(".0" ,'0')
value.replace("," ,'.')
value.replace("”" ,'"')
value.replace("“" ,'"')
value.replace("’" ,"'")
Extract the first sentence from a text: (avoid plagiaat)
[edit]
value.match(/([A-Za-z0-9éë ,:"'()-]+)[.].*/).toString()
Make first character lowercase
[edit]
toLowercase(value.substring(0,1))+value.substring(1)
Conditional expression; concatenate city and locality
[edit]
"brouwerij in " + value + if(value==cells.Localiteit.value, "", " (" + cells.Localiteit.value + ")") +", België"

Facets and filters

[edit]

Using facets and filters you can limit the number of rows to be updated, or uploaded.

Schemas

[edit]

You need to build a schema to prepare the upload.

Error check

[edit]

You need to check the errors, and correct problems with missing data.

Preview

[edit]

Quick review of the updates that are pending.

Upload

[edit]

You can choose between:

Options

[edit]
  • You can set the user interface language via Preferences userLang

Use cases

[edit]

Known problems

[edit]
  1. Uploading to Wikidata can take a long time; you are not notified via a message "transaction in progress"
  2. Upload fails
  3. Upload runs, but nothing is saved; probably technical error, see column "Wikibase editing results"

Duplicate items

[edit]
  • Avoid creating duplicate items, aliases, statements, and values:
    • Reconciling might fail (slight different label, wrong instance, homonyms)
    • Manual intervention might be required (manual search via the GUI)
    • Reconcile in multiple cycles (filter already reconciled items)
    • Run validate_item.py to get matching Q-numbers
  • You could merge any duplicate items afterwards (please be careful that it is really duplicate; e.g. museum exhibition ion against the building with the same name)

Manual corrections

[edit]

You need to be very carefull when performing uploads.

  • Be prepared to make manual corrections
  • Possibly you can use Pywikibot scripts to make mass corrections
  • You might need to rollback transactions

Spaces

[edit]
  • Trim spaces, and collapse repeated space characters for all columns before starting amending the data.
    • Items with redundant spaces will not reconcile !

Documentation

[edit]

Video

[edit]

Wikidata

[edit]

WikiLearn

[edit]

There is an online course available:

Source

[edit]

Software development, not for users.

OpenRefine helpdesks

[edit]

Sessions and projects

[edit]

See also

[edit]

Wikimedia