Re: Url regex keeps django busy/crashing
On Thu, Jul 26, 2012 at 10:45 PM, Joe <admin@gamebee.de> wrote:
> Hey, I have a url regex like this which is keeping django extremely busy
> (20secs to 1min to handle a request). On some urls it even crashes.
>
> my regex:
>
> url(r'^(?P<item_url>(\w+-?)*)/$', 'detail'),
>
>
> view:
>
> def detail(request, item_url):
> i = get_object_or_404(Page, url=item_url,published=True)
> return render_to_response('item/detail.html', {'item':i},
> context_instance=RequestContext(request))
>
> replaced with:
>
> url(r'^(?P<item_url>[\w-]+)/$', 'detail'),
>
>
> The replacement works like a charm. What is wrong with the first regex?
Hi Joe,
There's nothing strictly *wrong* with the first regex -- it's just
describes a very complex lookup strategy, and as a result, it takes
extra time to compute it.
In the second regex, you're asking for "a string of 1 or more
characters that are either word-like or '-'". That's a very easy thing
to check - if you think of how you would manually implement code that
check that policy, it could be done with a simple if inside a while
loop; as soon as you find a character that doesn't match, you can bail
out.
However, the first regex is asking for "0 or more groups of word like
characters, each of which might be followed by a '-'". Consider a
trivial case, matching against the string abcde. It can match the
first regex in an incredible number of ways:
(a)(b)(c)(d)(e)
(ab)(c)(d)(e)
(abc)(d)(e)
(abcd)(e)
(abcde)
(a)(bc)(d)(e)
(a)(bcd)(e)
(a)(bcde)
(a)(b)(cde)
…
and so on. Because you're asking the regex to preserve groups, the
algorithm needs to essentially work out every single one of these
groups, and then determine which set will be reported as the actual
match. As you can guess, this can take some time, which you're
observing as a 1 minute delay in serving a URL.
This is one of the gotchas that comes from using regular expressions.
They're a very powerful language for expressing constraints, but you
need to be careful that you don't accidentally fall into a trap where
you're asking for something very complex.
And don't worry - you're in good company being bitten by this problem.
There was a Django security release caused *specifically* by a regular
expression like yours. Django uses regular expressions to validate
URLs and email form inputs, and at one point, the regex that was used
to validate email addresses was constructed in such a way that it was
possible to provide a very simple string that would cause the
validator to take 30 seconds to confirm that it wasn't valid. Write a
tool that hits the same URL and validates the same string 100 times,
and you've got yourself a DDOS attack.
So - when you're building your URL patterns, you should be trying to
keep your regular expressions as simple as possible -- i.e., simple
linear probes. If you really do need to match a complex pattern, you'd
be better served using a simple regex in the URL pattern, and then
doing more specific validation in the view (and raising 404 if the
pattern doesn't match what you need it to).
Yours,
Russ Magee %-)
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to django-users+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.
0 Comments:
Post a Comment
Subscribe to Post Comments [Atom]
<< Home