OpenConvert help

The present OpenConvert tools output TEI from a number of input formats. Furthermore, as a proof of concept, the website currently provides two annotation tools: a simple Tokenizer for TEI files and a modern Dutch part of speech tagger.

Using the website

You can use the tool website to submit files, text or a URL to to the annotation tools. To do so, several parameters must be set through the HTML form:
  • The input format: This option is only available for file uploads. Supported formats are TEI, plain text, HTML, Alto (experimental), Microsoft word 97 or 2010 (.doc, .docx). Please keep in mind that conversion from semi-structured text to TEI is not a very cleary defined task, and results may not be satisfactory in many cases.
  • The output format: TEI or (experimentally) FoLiA.
  • Output type ("show output as"): The tool service can provide the result as a link to the tagged file, or output the resulting TEI structure immediately. The options are:
    • Styled: formatted for on-screen reading and inspection of the linguistic annotation added by the selected tool.
    • Prettyprinted XML: indented and colored XML source rendering
    • Link: link to the tagged XML file. Keep in mind that the INL does not guarantee persistence of the linked result files for more than a couple of hours.
    • Raw: output of the tagged XML file (your browser will render it in some way). Use this, or the previous option, if you want to use the result for futher automatic processing.
  • Linguistic annotation: choice of annotation tool (tokenizer or tagger/lemmatizer)

Using the OpenConvert command line and web service

Using the web service

The tool service can be called as a REST webservice which returns responses in XML, allowing it to be part of a webservice tool chain.

The service accepts multipart form data input and writes the output directly in the response.

Relevant URL parameters:

Parameter name

description

Possible values

tagger

Linguistic annotation tool name

  • chn-tagger: Basic tagger-lemmatizer for modern Dutch
  • tokenizer: A TEI tokenizer

format

Format of input file

tei, html, alto, word, docx, epub, text

to

Format of output file

tei, folia

input

Input file (file upload)

Name of any file on your computer

The service can (for instance) be access by a simple command line client java program (openconvert.client.jar) which we provide as an example. It uploads a file to the service and writes the servlet response to standard output.

The usage for this client is:

usage: java nl.inl.openconvert.OpenConvertClient <file to be tagged>
Options:
-f, --from <arg>	input format
-t, --to <arg> 	output format    
-s, --serverURL <arg>   location of tagging service
-a, --annotation <arg>  (chn-tagger or tokenizer)

Using the command line

The OpenConvert distribution can be accessed at https://github.com/INL/OpenConvert.

The command line can be used as follows:

 java -jar OpenConvert.jar -from <input format> -to <output format> <input> <output>
Options:
-from 	input format: text, TEI, alto, doc, docx, HTML
-to	output format: TEI, text or folia
Arguments:
input	filename, directory name or zip archive name (ending with .zip)
output	filename, directory name or zip archive name (ending with .zip)
If the from and to flags are omitted, the conversion to be applied will be guessed from file name extensions.