![]() ESN 60361-090227-520744-31 |
|
Document Name: A mistaken grep Document Description: A mistaken grep2009/02/27 A recent post at Unix.com was from someone having difficulty with "grep". This happened to be on Mac OS X, but it really could have happened almost anywhere, even on Windows. The poster was trying to grep a string from a Neo Office document and of course not getting great results. Apparently he'd gotten wind of grep from a brief mention in Pogue's The Missing Manual. Admittedly, Pogue isn't very clear about things; he says something like "its search material can be part of any file. especially plain text files" (emphasis mine). I'm not here to beat up on Pogue for his poor comprehension of Unix utilities, but the "especially text files" might have been what prompted this person to follow up with this: (From post at "proper syntax of grep command" - Unix.com) I just copied the NeoOffice file (saved as .doc, a word format) to text edit (a .txt file) and the command worked (grep 'I am writing') but it printed the whole letter, or most of it -- which I'm guessing means that it found all the lines with any of the three words in the string and printed those lines. Is it possible to use 'grep' to find this particular sentence fragment and no other lines which don't contain this entire fragment? You and I know that COPYING a file to something with a .txt extension changes nothing. It's still a .doc file and always will be. If he or she wanted a text file, they needed to do a "Save As" from Neo and choose a text format to save into. Then and only then is grep likely to return the expected results. However, is there anything basically wrong with the thought process that happened here? Would it be fair to say "You just aren't getting it - you are thinking of files incorrectly"? True, they ARE not understanding what grep expects, but is that their fault or grep's? After all, there is precedence for programs behaving differently when they have different names. On many systems, "ls" and "lc" are the same binary hard linked to two names. Invoked as "lc", the binary acts as though it has been given the "-C" flag (on OS X that's the default for terminal output anyway). If a binary can behave differently based on its name, why can't a file do the same? Well, hold on there: a binary is a program. A plain file is just a static collection of bytes. It's unreasonable to think that it could present different data - it only HAS one data set! Well, no, that's not necessarily true, especially on OS X. What about resource forks? What about meta data? Would it be entirely unreasonable for a file to expose different parts of its data based on the name used to access it? I'd say no, that's not at all unreasonable. It's also not unreasonable to think that programs could treat files differently based on names. Why not? Why couldn't "grep" on a .doc file be designed to ferret out paragraphs while reverting to "normal" behavior for text files? Of course it could. It could do OCR on image files or strip out certain colored pixels - I can imagine all sorts of useful things grep could do for many different files and I can certainly see it being designed to act differently on identical data presented under a different name. Of course it doesn't work that way, but it COULD. The fact that I can't think of any program that does treat data differently based on name is not important either: such programs could be written and the paradigm could actually be useful. So, our would-be grep user needs to learn a few things about what Unix utilities actually expect. That's fine. But maybe we can learn a few things too. Maybe a little mind shift on our part might actually turn into something useful. Author: Anthony Lawrence - Contact Author Publisher: Anthony Lawrence Licensee Name: Anthony Lawrence Reference URL: http://aplawrence.com/MacOSX/mistaken_grep.html Copyright: All Rights Reserved Registration Date: 2/27/2009 10:33:03 PM UTC Views: 283 |
