emdros2laf 4.8.3¶

Submodules¶

emdros2laf.settings module¶

class Settings[source]¶

Bases: object

Stores configuration information from the main configuration file and the command line.

Defines an extra function in order to get the items in a section as a dictionary, without getting the DEFAULT items as wel

annotation_skip = {'self'}¶

flag(name)[source]¶

laf_switches = {'comment_local_deps'}¶

emdros2laf.etcbc module¶

class Etcbc(settings)[source]¶

Bases: object

Knows the ETCBC data format.

All ETCBC knowledge is stored in a file that describes objects, features and values. These are many items, and we divide them in parts and subparts. We have a parts for monads, sections and linguistic objects. When we generate LAF files, they may become unwieldy in size. That is why we also divide parts in subparts. Parts correspond to sets of objects and their features. Subparts correspond to subsets of objects and or subsets of features. N.B. It is “either or”: either

a part consists of only one object type, and the subparts divide the features of that object type

or

a part consists of multiple object types, and the subparts divide the object types of that part. If an object type belongs to a subpart, all its features belong to that subpart too.

In our case, the part ‘monad’ has the single object type word, and its features are divided over subparts. The part ‘lingo’ has object types sentence, sentence_atom, clause, clause_atom, phrase, phrase_atom, subphrase, word. Its subparts are a partition of these object types in several subsets. The part ‘section’ does not have subparts. Note that an object type may occur in multiple parts: consider ‘word’. However, ‘word’ in part ‘monad’ has all non-relational word features, but ‘word’ in part ‘lingo’ has only relational features, i.e.features that relate words to other objects.

The Etcbc object stores the complete information found in the Etcbc config file in a bunch of data structures, and defines accessor functions for it.

The feature information is stored in the following dictionaries:

(Ia) part_info[part][subpart][object_type] = set of feature_names: NB: object_types may occur in multiple parts.

(Ib) part_object[part] = set of object_types

(Ic) part_feature[part][object_type] = set of feature_names

(Id) object_subpart[part][object_type] = subpart

Stores the subpart in which each object type occurs, per part

object_info[object_type] = [attributes]

Stores the information on objects, except their features and values.

feature_info[object_type][feature_name] = [attributes]

Stores the information on features, except their values.

value_info[object_type][feature_name][feature_value] = [attributes]

Stores the feature value information

reference_feature[feature_name] = True | False

Stores the names of features that reference other object. The feature ‘self’ is an example. But we skip this feature. ‘self’ will get the value False, other features, such as mother and parents get True

annotation_files[part][subpart] = (ftype, medium, location, requires, annotations, is_region)

Stores information of the files that are generated as the resulting LAF resource

The files are organized by part and subpart. Header files and primary data files are in part ‘’. Other files may or may not contain annotations. If not, they only contain regions. Then is_region is True.

ftype

the file identifier to be used in header files

medium

text or xml

location

the last part of the file name. All file names can be obtained by appending location after the absolute path followed by a common prefix.

requires

the identifier of a file that is required by the current file

annotations

the annotation labels to be declared for this file

The feature information file contains lines with tab-delimited fields (only the starred ones are used):: 0* 1* 2* 3* 4* 5* 6 7* 8 9 10 11* 12* object_type, feature_name, defined_on, etcbc_type, feature_value, isocat_key, isocat_id, isocat_name, isocat_type, isocat_def, note, part, subpart 0 1 2 3 4 5 6 7 8

Initialization is: reading the excel sheet with feature information.

The sheet should be in the form of a tab-delimited text file.

There are columns with:

ETCBC information:: object_type, feature_name, also_defined_on, type, value.
ISOcat information: key, id, name, type, definition, note
LAF sectioning: part, subpart

See the list of columns above.

So the file gives essential information to map objects/features/values to ISOcat data categories. It indicates how the LAF output can be chunked in parts and subparts.

check_raw_files(part)[source]¶

feature_atts(object_type, feature_name)[source]¶

feature_info = {}¶

feature_list(object_type)[source]¶

feature_list_subpart(part, subpart, object_type)[source]¶

is_ref_skip(feature_name)[source]¶

list_ref_noskip()[source]¶

make_mql(name, query)[source]¶

make_query_file(part)[source]¶

mql(query)[source]¶

object_atts(object_type)[source]¶

object_info = {}¶

object_list(part, subpart)[source]¶

object_list_part(part)[source]¶

object_subpart = defaultdict(<function Etcbc.<lambda>>, {})¶

part_feature = defaultdict(<function Etcbc.<lambda>>, {})¶

part_info = {}¶

part_list()[source]¶

part_object = defaultdict(<function Etcbc.<lambda>>, {})¶

raw_file(part)[source]¶

reference_feature = {}¶

run_mql(query_file, result_file)[source]¶

settings = None¶

subpart_list(part)[source]¶

the_subpart(part, object_type)[source]¶

value_atts(object_type, feature_name, feature_value)[source]¶

value_info = {}¶

value_list(object_type, feature_name)[source]¶

emdros2laf.laf module¶

class Laf(settings, et, val)[source]¶

Bases: object

Knows the LAF data format.

All LAF knowledge is stored in template files together with sections in the main configuration file. The LAF class finds those templates, sets up the result files, and fills them.

Note:

Templates

template[key] = text: where key is an entry in the laf_templates section of the main config file.

Note:

Files and Filetypes

annotation_files[part][subpart] = (ftype, medium, location, requires, annotations, is_region)

The order is important, so we generate a list too:

file_order: list of ftypes according file_types section in main config file, expanded, in the order encountered

where

ftype

comes from the file_types section in the main config file. It has the shape of LAF file identifier, but with wild cards.

f.xxxxxx: not an annotation file, but primary data or a header file
f_part.subpart: annotation file for part, subpart

for each ftype

there is an infostring consisting of fields

location: file name of corresponding file, modulo a common prefix
medium: file type (text or xml)
annotations: space separated annotation labels occurring in this part, subpart
requires: space separated list of ftypes of required files

is_region

reveals whether the file only contains regions or not. A pure region file needs a different template.

Note:

Header Generation

All header files are generated here: * the feature declaration file * the header for the resource as a whole * the header for the primary data file

The headers of the annotation files are included in those files. Those headers contain statistics: counts of the number of annotations with a given label. We know those number only after generation because these statistics will be collected during further processing.

When the annotation files are generated, we use placeholders for the statistics. In a post-generation stage we read/write the annotation files and replace the place holders by the true numbers. The files are written in situ. So we must take care that the placeholders contain enough space around them.

Note:

Processing

This class provides methods to initialize and finalize the generation of primary data files and annotation files. There are methods to open/close all files that are relevant to the part that is being processed. (Part being: ‘monad’, ‘section’, ‘lingo’).

Note:

Statistics

Counts are collected in a stats dictionary.

stats[statistic_name] = statistic_value*

annotation_files = defaultdict(<function Laf.<lambda>>, {})¶

et = None¶

file_handles = {}¶

file_order = []¶

finish_annot(part)[source]¶

finish_primary()[source]¶

gstats = defaultdict(<function Laf.<lambda>>, {})¶

makefeatureheader()[source]¶

makeheaders()[source]¶

makeprimaryheader()[source]¶

makeresourceheader()[source]¶

primary_handle = None¶

report()[source]¶

settings = None¶

start_annot(part)[source]¶

start_primary()[source]¶

stats = defaultdict(<function Laf.<lambda>>, {})¶

template = {}¶

emdros2laf.transform module¶

class Transform(settings, et, lf)[source]¶

Bases: object

Transforms ETCBC data into a LAF resource

ETCBC knowledge comes from the Etcbc class LAF knowledge comes from the Laf class

read data from raw MQL export and build the annotations files For part monad there are extra things: * the primary data file will be built * one of the annotations files only contains regions, and no annotations

et = None¶

lf = None¶

process_lines(part)[source]¶

Data transformation for part. Input: the lines of a raw emdros output file, which is processed line by line. Every line contains an object type, object identifier, monad indicator and list of features. This has to be translated to primary data and annotations.

Efficiency is very important. It will not do to call functions or follow long chains of dereferencing. Yet a lot has to happen. That is why this is a lengthy loop, and we maintain quite a lot of information from elsewhere in the program in loop-global variables. Not doing so might increase the running time 10-fold. Currently the complete programs runs within 15 minutes (inclusing generating raw data and validating) on an MacBook Air mid 2012.

settings = None¶

transform(part)[source]¶

interval(iv)[source]¶

makeuni(match)[source]¶: Make proper unicode of a text that contains byte escape codes such as backslash xb6

primary_data(text, trailer)[source]¶: Distil primary data from two features on the word objects. Apply necessary tweaks!

emdros2laf.validate module¶

class Validate(settings)[source]¶

Bases: object

Validates all generated files, knows the schemas involved.

The main program generates a bunch of XML files, according to various schemas. They can be sent to this object, with or without a schema specification. All files with a schema specification will be validated.

The base locations of the schemas and of the generated files will be retrieved from the main configuration. All schemas will be copied from source to destination.

generated_files = list of [absolute_path, schema in destination, validation result]

Initialization is: get from config the schema locations and copy them all over

add(xml, xsd)[source]¶

Add an item to the generated files list. If xsd is given, the file will eventually be validated.

The validation result will be stored in a member of the item, which is initially None. If validation takes place, None will be replaced by True or False, depending on whether the xml is valid wrt. the xsd.

generated_files = []¶

report()[source]¶: Print a list of all generated files and indicate validation outcomes

settings = None¶

validate()[source]¶: Validate all eligible files, but only if the validation flag is on

emdros2laf.run module¶

dotask(part)[source]¶

final()[source]¶

init()[source]¶

processor()[source]¶

emdros2laf.mylib module¶

class Timestamp[source]¶

Bases: object

elapsed()[source]¶

progress(msg)[source]¶

timestamp = None¶

camel(text)[source]¶

fillup(size, val, lst)[source]¶

pretty(data)[source]¶

run(cmd, dyld=False)[source]¶

runx(cmd, dyld=False)[source]¶

today()[source]¶