The Omni Group
These forums are now read-only. Please visit our new forums to participate in discussion. A new account will be required to post in the new forums. For more info on the switch, see this post. Thank you!

Go Back   The Omni Group Forums > OmniWeb > OmniWeb General
FAQ Members List Calendar Search Today's Posts Mark Forums Read

 
Ad Blocking With Regular Expressions HOWTO Thread Tools Search this Thread Display Modes
As so many people are having problems with the fact that OW uses regular expressions instead of glob patterns (like "*.dmg") for ad blocking I decided to write a short HOWTO on using regular expressions. So any questions about that can now be redirected to this thread. ;-)

Translating glob patterns into regular expressions

Glob patterns are the stuff most people are used to, like:
  • "*" will match any char, 0 or more times — "old*" will match "old times", "old news", or "old"; "*new*" will match "something new", "news", or "new".
  • "?" will match any single char — "test.?mg" will match "test.img", "test.dmg", or "test.omg", but not "test.mg"; "test.???" will match "test.txt", "test.htm", or "test.bak", but not "test.html" or "test.js". (Many people don't know this one.)

To translate these into regular expressions, do the following:
  • Replace "*" by ".*" — "old*" -> "old.*"
  • Replace "?" by "." — "n?w" -> "n.w"
  • Escape special characters — There are several characters that have a special meaning inside a regular expression. If you don't want that meaning, but the literal character, you have to prepend it with a backslash (\). For URLs, the only allowed characters (in URLs) that have a special meaning in regular expressions are ".", "+" and "?". So "some.host.com" becomes "some\.host\.com". You'd also have to escape the "?" and the "+" if you really wanted to match form data, which doesn't really make sense in this context; but nevertheless: "some.host.com\?search=foo+bar" becomes "some\.host\.com\?search=foo\+bar". (As "?" has a special meaning in glob patterns, too, it would also have to be escaped in the glob pattern.)
  • You don't have to match the whole string — Different from glob patterns, regular expressions match anywhere within the string. For example, if you wanted to match anything from some.host.com, in a glob pattern you'd use "http://some.host.com/*". For regular expressions, though, it is enough if your pattern exists somewhere within the string. Thus, "some\.host\.com" is enough. Similarly, the glob pattern "*ad*" has the same effect as the regular expression "ad". See below if you want to match an exact string.

Some examples:

"http://*.falkag.*" -> "http://.*\.falkag\..*" — or even simply "\.falkag\."
"http://some.ad.com/*" -> "some\.ad\.com"
"*.swf" -> ".*\.swf"

Ok, you're done. With the knowledge from above, you can translate any glob pattern into a regular expression. If that's enough for you, don't read on. :-)

For now we will see what else can be done with regular expressions, and how regular expressions work. From this point on, it will get rather complicated. So be warned. :-)

The Power of Regular Expressions

I won't explain everything; regular expressions are even more powerful than what I will show you here, but I will try to explain how they work and show you the most useful details for ad blocking.
  • How to match an exact string and the end of a string — As said above, regular expressions pattern will match anywhere within the string. But what if you want to block only a single image from a server? Or if you want to match only if your pattern is at the end of the string? For these cases, use "^" and "$". "^" means "the beginning of the string", "$" means "the end of the string". If you wanted to block "some.host.com/ad.gif", you could use "^http://some\.host\.com/ad\.gif$". If you want to block flash, you'll need to block all URLs that end with ".swf", so "\.swf$" is the pattern of choice.
  • A char not followed by a special character means: This char, exactly once. — This might seem obvious: An "a" means an "a". But the key term is "not followed by a special character". Those special characters can change everything—see below.
  • "." means: Any char, exactly once. — "..." will match "aaa", "abc", "xyz", " " (three spaces) and so on. But remember: It will match anywhere within the string, so it won't match only if the string is shorter than three characters! If you wanted exactly three characters, you could use "^...$".
  • "*" means: The char before, 0 or more times. — "a*" will match "a", "aa", "aaaaaa", and even "" (zero times, too!), but not "b"—anywhere inside the string. So "abc*" will match "This string contains abc.", "The chars ab are part of this string.", or "abccccccccc doesn't make much sense." This is often used with "."—".*" means "any char, 0 or more times".
  • "+" means: The char before, 1 or more times. — Same as above, but the char before has to be there at least once.
  • "?" means: The char before, 0 or 1 time. — "hell?o" will match "hello" and "helo" anywhere inside the string. "htm.?$" will match "htm", "html", "htmx", or "htmu" at the end of the string.
  • Character groups — Now what if you want to match ... say "flash" and "flesh", but nothing else, like "flosh", "flush", or "flxsh"? Regular expressions give you character groups for that. Just put the chars you look for into square brackets. "[abc]" means: An "a", a "b" or a "c", exactly once. In this example, you could use "fl[ae]sh" as your pattern. Of course, this can be combined with all the special characters. But beware: "[abc]*" doesn't only match "aaaaaaaa", "bbbb", and "c", but also "acbabacabacab". But always remember: We match anywhere within the string. "[abc]*" will thus match "kjngkjbfgkjnadkfjn", as there is an "a" in the string. Furthermore, it will even match "kjngkjngtkjrt" where there is no [abc]—"*" says "0 or more times"! "[abc]" or "[abc]+" won't match that string, though, obviously.
  • "Negative" character groups — Often, it's more convenient to say which chars you don't want rather than which you want. If the first character after the opening bracket is a "^", it means "not". So "[^ab]" means "one char, but not 'a' or 'b'". I use this in my ad blocking settings: I have a line that catches at least 70% of the ads shown: It is "/ad[^dm]". Before you read on: Can you tell what it does? Yes, it blocks any URL that has "/ad" somwhere inside, but not followed by a "d" or an "m"—we don't want to block "/additional" stuff or "/admin" pages.

So.

This is a first step into regular expressions. Yes, they are complicated, but they are also extremely powerful. There are more options than what I described here, but I think this is by far enough for ad blocking. :-)

I hope you can use these explanations.

Last edited by zottel; 2006-11-03 at 04:35 PM..
 
 


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes


Similar Threads
Thread Thread Starter Forum Replies Last Post
howto: date AND time via email Vlad Ghitulescu OmniFocus 1 for Mac 0 2011-04-30 12:20 AM
HowTo: Dependency on % Completion instead of date or actual complete? lelandv OmniPlan General 6 2010-04-29 06:03 AM
Lockup with regular expression URL blocking on certain sites bmastenbrook OmniWeb Bug Reports 4 2010-03-04 09:40 AM
Howto create a new stencil? dseehof OmniGraffle Extras 1 2008-08-04 10:25 AM
HowTo - Download mp3's, not play MacKevG OmniWeb General 7 2007-08-26 03:49 PM


All times are GMT -8. The time now is 06:14 AM.


Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2024, vBulletin Solutions, Inc.