OpenRefine

From Wikimedia Belgium
Jump to navigation Jump to search

OpenRefine is a powerful tool to manipulate and prepare data to be uploaded to, or retrieved from, Wikidata.

Multiple plugins exist. You can import and export data from and to multiple formats.

It is more powerful than Excel, but a bit more difficult to use.

What is it?[edit]

Originally called Google Refine, an automated tool to manipulate lists of data, based on rows and columns. Front-end to Wikidata.

Install[edit]

Download the zip file. Unzip. Possibly install the JRE (Java Run-tine Environment).

Run[edit]

Run openrefine.exe from Explorer, which will open a browser:

Stop[edit]

Save your data. Close the DOS window.

Functionality[edit]

Presentation OpenRefine
  • Data import (multiple targets)
    • Input from CSV, Excel, Google Spreadsheet, XML file, paste buffer (very practical)
    • Skip empty rows
  • Selectively delete rows based on facets (filter query)
  • Cleanup: merge data, detect and correct outliers
  • Transform: split strings into columns, concatenate columns, based on GREL = General Refine Expression Language
  • Reconcile: validate and get ID data from e.g. Wikidata
    • Choose an instance, or reconcile against no particular type
    • Verify if you got the right item (check homonyms based upon descriptions or statements)
    • Choose the right homonym
    • Flag the item for creation if it does not exist yet
  • Create columns based on Wikidata statements from reconciled items
  • Enrich: get additional data by ID from external databases
  • Verify the data quality
  • Create Wikidata schemas (prepare data upload: item labels, descriptions, alias, statements)
  • Upload to Wikidata
    • Pay attention not to create duplicate items or statements
      • create prerequisite items
      • first amend existing items
        • create a minimum list of statements
      • then create new items; multiple targets are possible (use multiple facets when required)
        • add all required statements => no risk for duplicate statements since new items are created
  • Extensions

Menu commands[edit]

  • Reconcile against no type
  • Convert to text, before uploading a date
  • Trim leading and trailing spaces
  • Collapse repeated white space

GREL functions[edit]

  • Replace special characters:
value.replace(".0" ,'0')
value.replace("”" ,'"')
value.replace("“" ,'"')
value.replace("’" ,"'")
  • Extract first sentence from a text: (avoid plagiaat)
value.match(/([A-Za-z0-9éë ,:"'()-]+)[.].*/).toString()
  • Make first character lowercase
toLowercase(value.substring(0,1))+value.substring(1)

Options[edit]

  • You can set the user interface language via Preferences userLang

Unresolved[edit]

  • How to add rows?

Use case[edit]

Known problems[edit]

  • Uploading to Wikidata can take a long time; you are not notified via a message "transaction in progress"
  • You can merge any duplicate items afterwards (please be careful that it is really duplicate; e.g. museum against building)

Documentation[edit]

See also[edit]