released, and upcoming faster deblob-check script

Alexandre Oliva lxoliva at fsfla.org
Tue Jan 26 02:31:25 UTC 2010

I think I forgot to announce before that.  And, just in
case, and are also available.

Please let me know what you think of the new GNU/Linux-libre banner at
the top of the http://linux-libre.fsfla.org web page.  I've used it as
the bootsplash logo in the latest builds for gNewSense/mipsel, and will
use it in upcoming Freed-ora builds as well.

On another piece of news, I have a brand new implementations of
deblob-check that is much faster.  In my first experiments with it, the
day was released, I managed to complete two deblobbings
sessions using two slightly modified versions of it, with some time to
spare, before the old deblob-check, used for the official release,

Some minor differences remained, and I still had some planned changes,
so I figured I'd stick to the old script a while longer, but if you'd
like to try the new one, you can get it from the svn repository:

Making it faster took some significant reengineering.  I first tried to
rewrite the largest and slowest sed script in python.  It wasn't too
hard because I had been preparing the 1000+ regular expressions that
match blobs and false positives for mechanical conversion.

The python script was small and nice, its start-up time and memory
footprint was ridiculously small compared with that of sed, but it took
forever to process some specific files even in Linux-libre.  Oh, the
wonders of PCRE-derived “regular expression” engines...

So I turned to awk, and the script was not as beautiful, but results
were still pretty good.  The smaller start-up time of GNU awk (3s vs
12s) makes for a lot of advantage, but that's not all: although sed is
still some 20% faster in checking that the Linux-libre tarball is clean,
awk is more than twice as fast in printing all the blobs in the Linux
tarball.  An earlier revision of the awk script, that didn't rely on a
GNU awk RS extension to avoid lots of string concatenation, still
consumed about 2GB of RAM to check the Linux-libre tarball, but a
revision that uses this extension kept memory consumption limited to
some 150MB.  Yay!  deblob-check is again usable on limited hardware!

But that's not all.  Still hopeful that perl's regular expressions
engine could do some magic where python failed, I rewrote the script in
perl.  No luck, it still performed badly on some relatively simple
files.  Turns out the exponential behavior exhibited by backtracking
gets really slow really fast.

I studied a bit of how these engines worked to try to learn how to avoid
the hotspots.  I managed to rework some of the most common patterns to
avoid very slow behavior even for such small inputs as the ones in the
built-in testsuite and the CREDITS file in Linux, but that was not
enough to get things to work fast in the common use cases, namely, to
deblob files containing blobs (as directed by deblob-main), or to check
that Linux-libre is clean.  deblob-check -C got sort of usable with some
inputs, even with python (without any difference for the NDA-based
O(m+n) engines used by gawk and sed), but there are still too many
patterns that match false positives that are going to need rewriting to
behave better with PCRE-based engines.  If you have some familiarity
with regexps and you'd like to help, that would be lovely.  perl has the
fastest regexp compilation time and the lowest memory footprint, but
that will only help when we avoid its taking almost 2 hours to go
through Linux-libre, reporting a few times that some internal limit had
been exceeded, and then falsely claiming it's not clean, and then,
running overnight to list the blobs in Linux and not finishing in almost
12 hours, when the awk-based version can complete both tasks in 3 to 5
minutes (vs sed taking 2.5 and 10 minutes).

Anyway, I still have some ideas on how to simplify further the logic in
the scripts in all these languages, even in sed, in which I was
surprised to find out a start-up time of 1.2 seconds for a script
containing only one match-all regular expression (like the perl and
python scripts), vs 3 seconds start-up time for awk, fractions of a
second for python and perl, and 11+ seconds for the current sed script.
This, and the fact that sed performs better than even awk at checking
Linux-libre, makes me think that using in sed the same general algorithm
I used in the other scripts might offer major speedups.  I'm going to
try that next.  In the mean time, you can play with all these variants
with the deblob-check script from the SVN repository, and the
newly-introduced --use-awk, --use-sed, --use-perl and --use-python.

Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

More information about the linux-libre mailing list