Tuesday, July 3, 2012

Re: Use regular expression to retrieve all image tags from a given content

On 3-7-2012 20:38, Tim Chase wrote:
> On 07/03/12 12:57, Melvyn Sopacua wrote:
>> On 30-6-2012 15:23, Sunny Nanda wrote:
>> What you're looking for is:
>> prog = re.compile(r'<img.*?/>')
>> matches = re.search(prog)
>> for match in matches :
>> print match
>>
>>> On a sidenote, you should not be using regular expressions if you are doing
>>> anything complex that what you are doing right now.
>>
>> This isn't complex. The email validator in django is complex. Using an
>> XML parser for this is quite overkill. If you need several elements
>> based on their nesting and/or sister elements, then an XML parser makes
>> more sense, or better xpath queries. This is simple stuff for regular
>> expressions and what they're made for.
>
> The reason for using a true parser is to avoid obscure edge cases.
> Your example fails on both
>
> <IMG ... >

Which is easily corrected with either <[Ii][Mm][Gg] or case-insensitive.
>
> and
>
> < img ... >

Which should fail.

> Also, depending on the use-case (such as stripping them out of
> validated code), a use-case such as
>
> <i<img>mg src="evil.gif">
>
> could get part stripped out and leave the evil <img> tag in the text.

r'<[Ii][Mm][Gg][^>]+[Ss][Rr][Cc]=[^>]+>' will leave very few corner
cases. The point is that if you want nothing but the tags (stripped or
matched), regular expressions can do the job just fine. It's actually
more complex to do this with parsers, as you have to deal with syntax
errors, keep state and rejoin the tags with the attributes for SAX based
parsers and the only advantageous parser is a DOM tree, which has a
large memory footprint on complex/large documents.
It's a trade-off you should make a decision on, not just blatantly
dismiss regular expressions when a document contains tags or call them
complex when they contain more then two characters. The call can even be
swayed in favor for either by the "I want to learn (regex|XML parsing)"
argument.
--
Melvyn Sopacua


--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to django-users+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home


Real Estate