=head1 The Genealogy::Gedcom Namespace

This document is about the new set of modules I (Ron Savage) am writing (2011-08-12):

=over 4

=item A) L<Genealogy::Gedcom>

=item B) L<Genealogy::Gedcom::Reader>

=item C) L<Genealogy::Gedcom::Reader::Lexer>

=item D) L<Genealogy::Gedcom::Reader::Lexer::DFA>

=item E) L<Genealogy::Gedcom::Reader::Parser>

=item F) L<Genealogy::Gedcom::Writer>

=back

The basic code, and hence the (dir) structure, of this set is copied directly from L<Graph::Easy::Marpa>, so yes, the parser will use Marpa.

=head1 FAQ

=head2 Which version of GEDCOM is the default for these modules?

DRAFT Release 5.5.1, in Ged551-5.pdf. See L</References> for downloading details.

=head2 Where does all this leave Paul Johnson's Gedcom.pm?

People using that module should continue to use it indefinitely.

I do not see L<Genealogy::Gedcom> as being a drop-in replacement for Gedcom.pm.

Nevertheless, if and when G::G reaches V 1.00 (what I call production quality), it might be a consideration for new code, since, dealing with identical input, they will obviously
have a number of methods in common.

=head2 What will the new modules do?

=over 4

=item A) L<Genealogy::Gedcom> is a dummy module, which will one day have methods to directly manipulate data at the individual and family level.

The following modules all operate at a lower level.

=item B) L<Genealogy::Gedcom::Reader> is a wrapper which calls both the lexer and the parser.

=item C) L<Genealogy::Gedcom::Reader::Lexer> is called by L<Genealogy::Gedcom::Reader>, and can be called from any module.

It reads the GEDCOM file, and lexes it, meaning it identifies tokens in the input stream, but does not assign meaning to those tokens. The paser assigns meaning to them.

Experts in the field are free to disagree with my casual definitions of lexing and parsing.

I call the GEDCOM file 'raw' data, so this CSV file is called 'cooked'. This terminology helps name some of the many command line options available.

Outputs supported by the lexer:

=over 4

=item o A RAM-based array (of tokens), to be passed to the parser, or to any other module.

Note: In this array, of type L<Set::Array>, each element is a hashref with these key => value pairs:

	count      => $myself -> _count,
	data       => defined($field[2]) ? $field[2] : '', # To allow for $field[2] eq '0'.
	level      => $field[0],
	line_count => $myself -> line_count,
	tag        => $field[1],
	type       => $type,

where @field = split(/\s+/, $input_record, 3).

The code reading the file removes leading and trailing spaces, and this usage of split handles indented GEDCOM files.

The removal of trailing spaces may cause inconvenience when the file deliberately contains a trailing space. See p 10 of the GEDCOM document, where it discusses the CONC and CONT tags.
This means the removal of trailing spaces may be dropped from the code, or made optional.

	count is the token count, 1 .. N. Hence it's just the array index + 1.
	data is '20 Dec 1775', in the GEDCOM record '2     DATE 20 Dec 1775'.
	level is '2', in that record.
	line_count is that record's line number within the input file. This helps identification of errors in the file.
	tag is 'DATE', within that record.
	type is the context of that record. If the parent of that record is 'BIRT', and BIRT's parent is '0 @I1@ INDI', then type is 'individual'.

In fact, the record '0 @I1@ INDI', and all its child records, have type 'individual'.

I have to use $myself, since the DFA calls functions, and $self is not available within these functions.

In fact, L<Set::FA::Element> calls these functions with only 1 parameter, the object of type L<Set::FA::Element> itself.

So, $myself is a global variable within the lexer, and is a copy of $self, and in this way I can circumvent the fact the DFA only calls functions.

Lastly, another note on the 'count' key. It is not the line number in the input stream because the code ignores (but still counts) blank lines.

=item o A CSV file (of tokens).

This file too can be passed to the parser, or to any other module.

By default, the CSV file is not produced.

=item o A pretty-printed report (of tokens), written to the log.

A logger option can suppress this report.

The default logger is L<Log::Handler>, whose default output goes to the screen.

=back

To do its work, L<Genealogy::Gedcom::Reader::Lexer> calls the next module.

=item D) L<Genealogy::Gedcom::Reader::Lexer::DFA>, where DFA stands for Discrete Finite Automaton (nick: State Machine). The latter module calls L<Set::FA::Element>,
a module which I did not write but which I now maintain.

L<Genealogy::Gedcom::Reader::Lexer> will perform some validation on the input tokens.

=item E) L<Genealogy::Gedcom::Reader::Parser> is also called by L<Genealogy::Gedcom::Reader>, and can be called from any module.

It too will perform some validation on the input tokens, which can come from RAM or a 'cooked' file.

=item F) L<Genealogy::Gedcom::Writer> will simply output the array of tokens, enabling round-tripping of the input stream.

=back

Later, there will probably be other modules in the series.

=head2 Yes, all well and good, but what's the point of re-writing Gedcom.pm?

Ahhh - I thought you'd never ask.

The point is that this design minimizes the effort of future support and maintenance.

The DFA takes a State Transition Table as a parameter to new(), and this STT can some from various sources:

=over 4

=item o A copy of the default STT is stored within the source code of L<Genealogy::Gedcom::Reader::Lexer>, after the __DATA__ token.

This makes it very fast to access (using L<Data::Section::Simple>), and hence this is the default source.

=item o The STT can be read in from any CSV file.

=item o The STT can be read in from any LibreOffice (nee Open Office) *.ods file.

=back

Obviously all forms of the STT have to be in the expected format, which is validated before being used.

You can see a copy of this STT L<here|http://savage.net.au/Perl-modules/html/genealogy/default.stt.html>. The last 2 columns have not been used yet.

The distro ships with scripts/stt2html.pl, which will convert the CSV STT into HTML for easy viewing.

In fact, I work on the STT in LibreOffice, and export it to a CSV file for testing. I can save it as *.ods too obviously. Then the CSV file can be incorporated in the lexer's source code.

This means anyone can easily experiment with patches to the STT, supporting any extension to GEDCOM they dream up, simply by editing text files.

I should say one reason I love this approach is that after examining the source code for Gedcom.pm, I couldn't really understand how it performs its magic, and didn't want to spend too much
time studying that technique. I assure you that I in no way mean to disparage Paul's work: Gedcom.pm is a very cleverly written module, which works very well indeed.

The advantage of my code is the text-based STT combined with the output of an array of lexed and parsed tokens, which anyone can siphon off for their own dark purposes.

Clearly this also means files exported from other genealogy programs can be imported into G::G by judicious editing of the STT.

=head2 Why do states in that STT appear to be doubled?

Because the design of L<Set::FA::Element> demands it.

When text in the input stream is consumed (by matching a regexp in the STT), what happens when the 'current' state and the 'next' state are the same?

L<Set::FA::Element> has adopted the convention that such an event is a noop (after the matching input is consumed). That means the DFA does I<not> execute the state's exit and entry
functions, even though that would sometimes convenient. So, in such a situation the STT is designed to rock back and forth between 2 identical states.

=head2 How do these new modules handle non-standard tags?

Currently, the lexer accepts valid tags which have suffixes. Hence both INDI and INDIVIDUAL are accepted. This will change when validation is implemented.

=head2 What about extensions to GEDCOM 5.5.1?

Any extension to the GEDCOM format has to be discussed (if only among Perl programmers), and documented in some manner compatible with the original document.

I have no such suggestions, but I definitely encourage those who do to use the Gedcom mailing list to elicit responses to their ideas.

My code's design simplifies adoption of such extensions. Other code may be just as easy to extend.

=head1 References

=over 4

=item o The original Perl L<Gedcom>

=item o GEDCOM

=over 4

=item o L<GEDCOM Specification|http://wiki.webtrees.net/File:Ged551-5.pdf>

=item o L<GEDCOM Validation|http://www.tamurajones.net/GEDCOMValidation.xhtml>

=item o L<GEDCOM Tags|http://www.tamurajones.net/GEDCOMTags.xhtml>

=back

=item o Usage of non-standard tags

=over 4

=item o L<http://www.tamurajones.net/FTWTEXT.xhtml>

This is apparently the worst offender she's seen. Search that page for 'tags'.

=item o L<http://www.tamurajones.net/GenoPro2011.xhtml>

=item o L<http://www.tamurajones.net/GenoPro2007.xhtml>

=item o L<http://www.tamurajones.net/TheFTWTEXTProblem.xhtml>

=back

=item o Other articles on Tamura's site

=over 4

=item o L<http://www.tamurajones.net/FiveFreakyFeaturesYourGenealogySoftwareShouldNotHave.xhtml>

=item o L<http://www.tamurajones.net/TwelveOrdinaryMustHaveGenealogySoftwareFeatures.xhtml>

=back

=item o Other projects

Many of these are discussed on Tamura's site.

=over 4

=item o L<http://bettergedcom.wikispaces.com/>

=item o L<http://www.ngsgenealogy.org/cs/GenTech_Projects>

=item o L<http://gdmxml.fugal.net/>

=item o L<http://www.cosoft.org/genxml/>

=item o L<http://www.sunflower.com/~billk/GEDC/>

=item o L<http://ancestorsnow.blogspot.com/2011/07/vged.html>

=item o L<http://www.tamurajones.net/GEDCOMValidation.xhtml>

=item o L<http://webtrees.net/>

=item o L<http://swoodbridge.com/Genealogy/lifelines/>

=item o L<http://deadendssoftware.blogspot.com/>

=item o L<http://www.legacyfamilytree.com/>

=item o L<https://devnet.familysearch.org/docs/api-overview>

=back

=back

=head1 The Gedcom Mailing List

Contact perl-gedcom-help@perl.org.

=cut
