| |||||||||||
| |||||||||||
Data Monging
PROCESSING LARGE QUANTITIES OF DATA "Data Monging" is a term that has come to be used for processing quantities of data - reformatting, extraction, etc. It's really what Perl's ALL about - the language has a number of features which make it especially good for the purpose. In this module, we highlight one or two of the more specialist of these features. ITERATING OVER DATA IN PERL So you want to do the same thing to every element of an array? In traditional languages which are not so full-featured as Perl, you'll use a loop, with a keyword such as "while" or "for", and a variable that steps up from 0 or 1 to the length of the array. You can do the same sort of thing in Perl as well if you wish: #!/usr/bin/perl # Using loops to pass through an "array" $tab[0] = $tab[1] = 1; for ($k = 2; $k<20; $k++) { $tab[$k] = $tab[$k-1]+$tab[$k-2]; } for ($k=0; $k<@tab; $k++) { printf("%4d\n",$tab[$k]); } Which gives: $ ./oldfash 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 $ Although you can use the traditional approach in Perl, there are other approaches too, which are often easier to code and more efficient at run time. You should remember that languages like C and Fortran had ARRAYS which were basic containers for a whole lot of variables all of the same type, whereas Perl uses LISTS which are much more flexible structures upon which operations can be performed in their own rights. Have a look at these two, both of which perform exactly the same as the example above: #!/usr/bin/perl # A better iteration through a list $tab[0] = $tab[1] = 1; for ($k = 2; $k<20; $k++) { $tab[$k] = $tab[$k-1]+$tab[$k-2]; } printf("%4d\n",$_) for (@tab); #!/usr/bin/perl # Another iteration through a list $tab[0] = $tab[1] = 1; for ($k = 2; $k<20; $k++) { $tab[$k] = $tab[$k-1]+$tab[$k-2]; } map {printf("%4d\n",$_) } @tab; = for (or foreach) can be used as an iterator to pass through each element of a list, performing a statement or block on each. = map iterates through each element of a list, performing a statement on each and returning the result of each statement into a new list = grep iterates through each element of a list, performing a test on each and returns a list of all the elements for which a true value was returned by the tests. Here's an example program that generates a list (of file names) and then user map, grep and for to modify, select and iterate through that list and its derivatives: #!/usr/bin/perl # Read all file names in current directory opendir (DH,"."); @indir = readdir(DH); # Sizes of files starting with "q" ... # Get the file names @qfiles = grep(/^q/,@indir); # Get the sizes of those files @fsizes = map {-s} @qfiles; print "q file sizes: @fsizes\n"; # Largest 10 files in the directory @ftable = map {[$_,-s]} @indir; @fts = sort {$$b[1]-$$a[1]} @ftable; @ft10 = @fts[0..9]; printf ("%8d %s\n",$$_[1],$$_[0]) for (@ft10); in operation: $ filedata q file sizes: 376 375 450 506 4964352 URL.txt 605455 access_log 353968 std.list.w.html 313472 stdrandom 291408 wac 206662 words 175426 317l18contigs6899.txt 140366 phone.list.w.html 114224 postcodes.html 114112 postcodes $ To help you understand what's happening, we wrote that example to create a number of temporary lists, using a single map or grep in each Perl statement. The code could be shortened and will run more efficiently: #!/usr/bin/perl opendir (DH,"."); # Sizes of files starting with "q" ... @fsizes = map {-s} grep /^q/,(@indir = readdir DH); print "q file sizes: @fsizes\n"; # Largest 10 files in the directory printf ("%8d %s\n",$$_[1],$$_[0]) for (sort {$$b[1]-$$a[1]} map {[$_,-s]} @indir)[0..9]; PROCESSING DATA THROUGH REGULAR EXPRESSIONS You may be used to using Perl's regular expressions to match patterns and extract matches from a line of text, perhaps stepping through all the lines of a file. Have you ever thought of using it on the whole contents of a file at one go? You can do so, provided you're sure that the file won't be so large that you'll fill your computer's memory / swap space. Here's and example that reads a UK postcode file, and prints out in reverse order the names of all postal towns that come under the main Aberdeen office: #!/usr/bin/perl # read in and locate appropriate postcodes # version 1 - conventional programming techniques open (FH,"postcodes") ; while ($line = <FH>) { if ($line =~ /Aberdeen$/) { push @aber,$line; } } for ($k=@aber;$k>=0;$k--) { print $aber[$k]; } #!/usr/bin/perl # read in and locate appropriate postcodes # version 2 - selection with grep open (FH,"postcodes") ; @aber = grep(/Aberdeen$/,<FH>); print reverse @aber; #!/usr/bin/perl # read in and locate appropriate postcodes # version 3 - using regular expressions open (FH,"postcodes") ; read (FH,$full, -s "postcodes"); @aber = ($full =~ /.*Aberdeen$/mg); print (join("\n",reverse @aber),"\n"); In all cases, the results look like this: $ pc1 (or pc2 or pc3) Turriff Aberdeenshire AB3 Aberdeen Strathdon Aberdeenshire AB3 Aberdeen Stonehaven Aberdeenshire AB3 Aberdeen Skene Aberdeenshire AB3 Aberdeen Peterhead Aberdeenshire AB4 Aberdeen Macduff Banffshire AB4 Aberdeen Laurencekirk Kincardineshire AB3 Aberdeen Keith Banffshire AB5 Aberdeen Inverurie Aberdeenshire AB5 Aberdeen Insch Aberdeenshire AB5 Aberdeen Huntly Aberdeenshire AB5 Aberdeen Fraserburgh Aberdeenshire AB4 Aberdeen Ellon Aberdeenshire AB4 Aberdeen Craigellachie Banffshire AB3 Aberdeen Buckie Banffshire AB5 Aberdeen Braemar Aberdeenshire AB3 Aberdeen Banff Banffshire AB4 Aberdeen Banchory Kincardineshire AB3 Aberdeen Ballindalloch Aberdeenshire AB3 Aberdeen Ballater Aberdeenshire AB3 Aberdeen Alford Aberdeenshire AB3 Aberdeen Aboyne Aberdeenshire AB3 Aberdeen Aberlour Banffshire AB3 Aberdeen ABERDEEN Aberdeenshire AB1,2 Aberdeen $ See also Perl for Larger Projects Please note that articles in this section of our
web site were current and correct to the best of our ability when published,
but by the nature of our business may go out of date quite quickly. The
quoting of a price, contract term or any other information in this area of
our website is NOT an offer to supply now on those terms - please check
back via our main web site
Related Material
Perl - Lists Perl - Handling Huge Data Perl - Data Munging resource index - Perl Solutions centre home page You'll find shorter technical items at The Horse's Mouth and delegate's questions answered at the Opentalk forum. At Well House Consultants, we provide training courses on subjects such as Ruby, Perl, Python, Linux, C, C++, Tcl/Tk, Tomcat, PHP and MySQL. We're asked (and answer) many questions, and answers to those which are of general interest are published in this area of our site. |
| ||||||||||
PH: 01144 1225 708225 • FAX: 01144 1225 707126 • EMAIL: info@wellho.net • WEB: http://www.wellho.net • SKYPE: wellho | |||||||||||