Migration Data Work HOWTO.txt

   1  Migration Data Work HOWTO / Toolkit
   2 ========================================================================
   3 The following is for migrating into an existing system like PINES:
   4
   5 Get the incoming bib data, and translate to UTF-8 MARCXML. It may
   6 contain holdings. It may contain XML or MARC errors that you have to
   7 sanitize before your tools will work.  This is one way to translate
   8 MARC-8 MARC21 to UTF-8 MARCXML:
   9
  10   yaz-marcdump -f MARC-8 -t UTF-8 -o marcxml \
  11     incoming.marc > incoming.marc.xml
  12
  13 If you need to trim the bibs to a subset based on the presence of a
  14 certain value in a specific tag/subfield (for example, if you have the
  15 bibs for all libraries in a foreign system and only need bibs
  16 belonging to a specific migrating library, you might filter based on
  17 their holding tags)
  18
  19   trim_marc_based_on_tag_subfield_value.pl 999 m BRANCH_CODE \
  20     incoming.marc.xml > incoming.filtered.marc.xml
  21
  22 Embed potential native record ids into the incumbent records
  23
  24   renumber_marc -rf 100000 -t 903 -s a -o incoming.renumbered.marc.xml \
  25     incoming.marc.xml
  26
  27 Get primary fingerprints for incoming data and get a bib dump of
  28 matching records from the incumbent system
  29
  30   fingerprinter -r primary -t 903 -s a -o incoming.primary.fp \
  31     -x incoming.primary.ex incoming.renumbered.mrc.xml
  32
  33 #Edit the query_for_primary_matching_incumbent_record.pl script to
  34 #point to the correct Evergreen database and table holding the
  35 #incumbent primary fingerprints (FIXME add in how to create such a
  36 #table).
  37 #
  38 #  query_for_primary_matching_incumbent_record.pl incoming.primary.fp \
  39 #    | sort | uniq > primary_matching_incumbent.record_ids
  40 #
  41 #In a postgres shell, you create a temporary table to hold these id's:
  42 #
  43 #  CREATE TABLE primary_matching_incumbent_records_for_incoming_library
  44 #         (id BIGINT);
  45 #  COPY primary_matching_incumbent_records_for_incoming_library
  46 #       FROM 'primary_matching_incumbent.record_ids';
  47 #
  48 #To dump the matching incumbent records to a file, in a postgres shell
  49 #do:
  50 #
  51 #  matching_incumbent_records.dump SELECT b.id, b.tcn_source, b.tcn_value,
  52 #    regexp_replace(b.marc,E'\n','','g')
  53 #    FROM biblio.record_entry AS b
  54 #    JOIN primary_matching_incumbent_records_for_incoming_library
  55 #    AS c using ( id );
  56 #
  57 #Now to turn that dump into a MARCXML file with record numbers and TCN
  58 #embedded in tag 901, do:
  59 #
  60 #  marc_add_ids -f id -f tcn_source -f tcn_value -f marc \
  61 #    < matching_incumbent_records.dump > matching_incumbent_records.marc.xml
  62 #
  63 #It's possible that this file may need to be itself sanitized some.
  64 #This will transform code=""" into code="&x0022;", for example:
  65 #
  66 #  cat matching_incumbent_records.marc.xml | \
  67 #    sed 's/code=\"\"\"/code=\"\&#x0022;\"/' \
  68 #    > matching_incumbent_records.escaped.mrc.xml
  69
  70 Get full fingerprints for both datasets and match them.
  71
  72   fingerprinter -r full -t 901 -s c -o incumbent.fp -x incumbent.ex \
  73     matching_incumbent_records.marc.xml
  74   fingerprinter -r full -t 903 -s a -o incoming.fp -x incoming.ex \
  75     incoming.renumbered.marc.xml
  76
  77 The script below will produce matched groupings, and can optionally
  78 take a 4th and 5th parameter providing scoring information for
  79 determining lead records. In the past, this would consider certain
  80 metrics for MARC quality, but in the latest incarnation, it assumes an
  81 incumbent record will be the lead record, and looks at # of holdings
  82 and possible matching of tag 245 subfield b for determining which of
  83 the incumbent records would be the lead record. The example
  84 invocation below does not use scoring.
  85
  86   match_fingerprints.pl "name of dataset for dedup interface" \
  87     incumbent.fp incoming.fp
  88
  89 This will produce two files, match.groupings and match.record_ids.
  90 The format for match.groupings is suitable for insertion into the db
  91 for the dedup interface.
  92
  93 Import these matches and records into the legacy dedup interface for viewing:
  94
  95 Now to tar up the specific MARC records involved for the dedup interface:
  96
  97   cat match.groupings | cut -d^ -f3 > incumbent.record_ids
  98   cat match.groupings | cut -d^ -f5 | cut -d, -f2- | sed 's/,/\n/g' \
  99     > incoming.record_ids
 100
 101 #  mkdir dataset ; cd dataset
 102 #  select_marc.pl ../incumbent.record_ids 901 c \
 103 #    ../matching_incumbent_records.mrc.xml
 104 #  select_marc.pl ../incoming.record_ids 903 a \
 105 #    ../incoming.renumbered.mrc.xml
 106 #  cd ..
 107 #  tar cvf dataset.tar dataset
 108 #
 109 #In a mysql shell for the database used with the dedup interface:
 110 #
 111 #  LOAD DATA LOCAL INFILE 'match.groupings' INTO TABLE record_group
 112 #    FIELDS TERMINATED BY '^'
 113 #    ( status, dataset, best_record,records,original_records );
 114 #
 115 #Create a pretty printed text dump of the non-matching incoming records:
 116 #
 117 #  dump_inverse_select_marc.pl incoming.record_ids 903 a \
 118 #    incoming.renumbered.mrc.xml > non_matching_incoming.mrc.txt 2> \
 119 #    non_matching_incoming.mrc.txt.err
 120
 121 marc2bre.pl --idfield=903 --dontuse=live_tcns.txt -f
 122 quitman_non_matching_incoming.mrc.xml -f
 123 catoosa_non_matching_incoming.mrc.xml --marctype=XML > some.bre
 124
 125 direct_ingest.pl < some.bre > some.ingest
 126
 127 perl pg_loader.pl -or bre -or mrd -or mfr -or mtfe -or mafe -or msfe
 128 -or mkfe -or msefe -a mrd -a mfr -a mtfe -a mafe -a msfe -a mkfe -a
 129 msefe < ~/gutenberg.ingest > ~/gutenberg.sql