OpenRefine

From Wikimedia Belgium
Jump to navigation Jump to search

OpenRefine is a powerful ETL data manipulation tool to validate, prepare data to be uploaded to, upload, or retrieve from Wikidata, and Wikimedia Commons.

Multiple plugins exist. You can import and export data from and to multiple formats.

It is more powerful than Excel, because it interfaces with external systems, but a bit more difficult to use. It is using open standards.

It has a read and write interface to Wikidata, and Wikimedia Commons (SDC). It can export to QuickStatements for large batch operations, or you can amend Wikidata interactively.

What is it?

[edit]

Originally called Google Refine, an automated tool to manipulate lists of data, based on rows and columns. A front-end to Wikidata.

Import and Export to other formats like Excel, CSV, Google Spreadsheets, etc.

Projects can be exported and imported.

Install the tool

[edit]

Read first:

Platforms

[edit]
  • Windows: Download the zip file. Unzip. Possibly install the Java Run-time Environment (JRE).
sudo apt install default-jre

Runtime environment

[edit]

Run

[edit]
  • On Windows: Run openrefine.exe from Explorer, which will automatically open a browser (localhost webserver).
  • On Linux:
    • Run the shell script "refine" (backend)
    • Open the application in the browser: http://127.0.0.1:3333 (frontend)

Storage

[edit]

Local storage on the client device. The location is depending on the platform.

Transfer

[edit]

You can transfer a project from one platform to another.

Stop

[edit]

Save your data. Close the command window, or exit the application (^c).

Functionality

[edit]
Presentation OpenRefine
  • Data import (multiple targets)
    • Input from CSV, Excel, Google Spreadsheet, XML file, paste buffer (very practical)
    • Skip empty rows
  • Selectively delete rows based on facets (filter query)
  • Cleanup: merge data, detect and correct outliers
  • Transform: split strings into columns, concatenate columns, based on GREL = General Refine Expression Language
  • Reconcile: validate and get ID data from e.g. Wikidata
    • Choose an instance, or reconcile against no particular type
    • Verify if you got the right item (check homonyms based upon descriptions or statements)
    • Choose the right homonym
    • Flag the item for creation if it does not exist yet
  • Create columns based on Wikidata statements from reconciled items
  • Enrich: get additional data by ID from external databases
  • Verify the data quality
  • Create Wikidata schemas (prepare data upload: item labels, descriptions, alias, statements)
  • Upload to Wikidata
    • Pay attention not to create duplicate items or statements
      • create prerequisite items
      • first amend existing items
        • create a minimum list of statements
      • then create new items; multiple targets are possible (use multiple facets when required)
        • add all required statements => no risk for duplicate statements since new items are created
  • Extensions
[edit]
  • Reconcile against no type (when inputting Q-numbers)
  • Convert to text, before uploading a date
  • White space (should be handled by Wikibase?)
    • Trim leading and trailing spaces
    • Collapse repeated white space

GREL functions

[edit]

GREL = General Refine Expression Language:

  • Replace special characters:
value.replace(".0" ,'0')
value.replace("”" ,'"')
value.replace("“" ,'"')
value.replace("’" ,"'")
  • Extract first sentence from a text: (avoid plagiaat)
value.match(/([A-Za-z0-9éë ,:"'()-]+)[.].*/).toString()
  • Make first character lowercase
toLowercase(value.substring(0,1))+value.substring(1)

Options

[edit]
  • You can set the user interface language via Preferences userLang

Unresolved

[edit]
  • How to add rows? In theory this is possible, but very complicated and technical.
  • Best is to add some redundant rows, or export and import a new sheet.

Use case

[edit]

Known problems

[edit]
  1. Uploading to Wikidata can take a long time; you are not notified via a message "transaction in progress"
    • You can export to QuickStatements (could be error prone; possible more manual work)
  2. Upload fails
    • You can export to QuickStatements (could be error prone; possible more manual work)
  3. Upload runs, but nothing is saved
    • You can export to QuickStatements (could be error prone; possible more manual work)

Duplicate items

[edit]
  • Avoid creating duplicate items, aliases, statements, and values:
    • Reconciling might fail (slight different label, wrong instance, homonyms)
    • Manual intervention might be required (manual search via the GUI)
    • Reconcile in multiple cycles (filter already reconciled items)
    • Run validate_item.py to get matching Q-numbers
  • You could merge any duplicate items afterwards (please be careful that it is really duplicate; e.g. museum exhibition ion against the building with the same name)

Manual corrections

[edit]

You need to be very carefull when performing uploads.

  • Be prepared to make manual corrections
  • Possibly you can use Pywikibot scripts to make mass corrections
  • You might need to rollback transactions

Documentation

[edit]

Wikidata

[edit]

WikiLearn

[edit]

There is an online course available:

Source

[edit]

OpenRefine helpdesks

[edit]

Sessions and projects

[edit]

See also

[edit]

Wikimedia