Saturday, January 23, 2010

comp.lang.python - 13 new messages in 4 topics - digest

comp.lang.python
http://groups.google.com/group/comp.lang.python?hl=en

comp.lang.python@googlegroups.com

Today's topics:

* list.pop(0) vs. collections.dequeue - 7 messages, 4 authors
http://groups.google.com/group/comp.lang.python/t/9221d87f93748b3f?hl=en
* medians for degree measurements - 4 messages, 3 authors
http://groups.google.com/group/comp.lang.python/t/ea0ad2a689de707c?hl=en
* Problems with the date of modification of files on the flash drive in
windows - 1 messages, 1 author
http://groups.google.com/group/comp.lang.python/t/dd1a415db7e08c1c?hl=en
* Using dictionary key as a regular expression class - 1 messages, 1 author
http://groups.google.com/group/comp.lang.python/t/6ee5d2027516c648?hl=en

==============================================================================
TOPIC: list.pop(0) vs. collections.dequeue
http://groups.google.com/group/comp.lang.python/t/9221d87f93748b3f?hl=en
==============================================================================

== 1 of 7 ==
Date: Fri, Jan 22 2010 9:42 pm
From: Steve Howell


On Jan 22, 6:20 pm, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.au> wrote:
> On Fri, 22 Jan 2010 14:38:18 -0800, Steve Howell wrote:
> > I know the Python programmer could simply iterate through the list
> > rather than popping off unused elements, but that just means that you
> > not only tie up the memory for the pointers just as long, but you also
> > prevent the objects themselves from being garbage collected.
>
> > In my case I'm not really concerned about giving the memory back right
> > away, it's more about keeping my code simple.  Once I'm done with an
> > element, I do just want to pop it off and keep all the simplicity of the
> > lists, but I am just concerned enough speed that for a 1000 element
> > list, I don't want to be doing 1000 memmoves for an average of 500
> > pointers, which effectively moves about a million bytes around for no
> > reason.
>
> > I suppose the solution is to either give up the sugar of lists, or try
> > to wrap something like deque or list-with-offset into a sequence.
>
> I don't understand what the actual problem you're trying to solve is.
> Despite your disclaimer about not being concerned about reclaiming the
> memory, it sounds like you're trying to micro-optimize memory usage.

My discussion of memory probably distracted you from the fact that I'm
not proposing a micro-optimization of memory; I am proposing a mega-
optimization of CPU.

This innocent program here literally moves about a million bytes of
memory around for no good reason:

lst = []
for i in range(2000):
lst.append(i)
while lst:
print lst.pop(0)

Why? Because list.pop(0) is implemented in O(N) instead of O(1).

Why wouldn't you get a competent C programmer simply make
list_ass_slice smart enough to make list.pop(0) O(1)?

The brilliant computer scientist, Christian Heimes, provides the
answers, and I am paraphrasing here, of course:

1) You can save 64 bits for every list by not storing an extra
pointer!
2) You can simplify the CPython implementation!
3) You can avoid the oh-so-expensive check "if ilow == 1" for all
operations that don't need this optimization!

Sounds like two micro-optimizations to me (and a copout to boot).

> That
> is, you give me the impression that you prefer this:
>
> while alist:
>     x = alist.pop(0)
>     process(x)
>
> over this:
>
> for x in alist:
>     process(x)
> # allow alist to be garbage collected when it goes out of scope
>

No, to be more precise, I prefer this implementation of a recursive
parser (using lists) to one that would have to use deque's because of
the lameness of Python's list implementation:

https://bitbucket.org/showell/shpaml_website/src/tip/shpaml.py


== 2 of 7 ==
Date: Fri, Jan 22 2010 9:58 pm
From: Steve Howell


On Jan 22, 7:09 pm, Roy Smith <r...@panix.com> wrote:
> In article
> <3ac173bd-4124-434d-b726-0b9baaeec...@36g2000yqu.googlegroups.com>,
>  Steve Howell <showel...@yahoo.com> wrote:
>
> > In my case I'm not really concerned about giving the memory back right
> > away, it's more about keeping my code simple.  Once I'm done with an
> > element, I do just want to pop it off and keep all the simplicity of
> > the lists, but I am just concerned enough speed that for a 1000
> > element list, I don't want to be doing 1000 memmoves for an average of
> > 500 pointers, which effectively moves about a million bytes around for
> > no reason.
>
> If you really want to pop all the elements from a long list, reverse the
> list and pop them off the end.  Popping every element off the beginning of
> the list is O(n^2), as you pointed out.  Reversing the list is O(n), and
> each pop after that is O(1), so the overall complexity is O(n).

I really want to use list *normally* with all its perfectly good
semantics and reasonable implementation, except for its blind spot
with respect to popping the first element off the list. The whole
reason I use CPython vs. C in the first place is that CPython
programmers can generally program basic data structures better than I
can. But list.pop(0) is the exception. And, with the possible
exception of dicts, lists are the most fundamental data structures
that Python has.

I know Python's number one concern will never be speed, but if Python
makes an O(1) operation into an unnecessarily O(N) operation for no
good reasons other than "it's too complicated, " or it "adds another
pointer to the structure," or "it adds another conditional check to
list_ass_slice for operations that aren't popping off the top," I
think it's reasonable to challenge the design philosophy.


== 3 of 7 ==
Date: Fri, Jan 22 2010 11:10 pm
From: aahz@pythoncraft.com (Aahz)


In article <83082e19-9130-45a8-91f2-8601c1fdaac3@22g2000yqr.googlegroups.com>,
Steve Howell <showell30@yahoo.com> wrote:
>
>I really want to use list *normally* with all its perfectly good
>semantics and reasonable implementation, except for its blind spot
>with respect to popping the first element off the list. The whole
>reason I use CPython vs. C in the first place is that CPython
>programmers can generally program basic data structures better than I
>can. But list.pop(0) is the exception. And, with the possible
>exception of dicts, lists are the most fundamental data structures
>that Python has.
>
>I know Python's number one concern will never be speed, but if Python
>makes an O(1) operation into an unnecessarily O(N) operation for no
>good reasons other than "it's too complicated, " or it "adds another
>pointer to the structure," or "it adds another conditional check to
>list_ass_slice for operations that aren't popping off the top," I
>think it's reasonable to challenge the design philosophy.

"Rough consensus and running code."

You have a good point, but nobody will ever give your idea serious
attention until there's a patch and benchmarks.
--
Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/

import antigravity


== 4 of 7 ==
Date: Fri, Jan 22 2010 11:43 pm
From: "Alf P. Steinbach"


* Steve Howell:
> On Jan 22, 7:09 pm, Roy Smith <r...@panix.com> wrote:
>> In article
>> <3ac173bd-4124-434d-b726-0b9baaeec...@36g2000yqu.googlegroups.com>,
>> Steve Howell <showel...@yahoo.com> wrote:
>>
>>> In my case I'm not really concerned about giving the memory back right
>>> away, it's more about keeping my code simple. Once I'm done with an
>>> element, I do just want to pop it off and keep all the simplicity of
>>> the lists, but I am just concerned enough speed that for a 1000
>>> element list, I don't want to be doing 1000 memmoves for an average of
>>> 500 pointers, which effectively moves about a million bytes around for
>>> no reason.
>> If you really want to pop all the elements from a long list, reverse the
>> list and pop them off the end. Popping every element off the beginning of
>> the list is O(n^2), as you pointed out. Reversing the list is O(n), and
>> each pop after that is O(1), so the overall complexity is O(n).
>
> I really want to use list *normally* with all its perfectly good
> semantics and reasonable implementation, except for its blind spot
> with respect to popping the first element off the list. The whole
> reason I use CPython vs. C in the first place is that CPython
> programmers can generally program basic data structures better than I
> can. But list.pop(0) is the exception. And, with the possible
> exception of dicts, lists are the most fundamental data structures
> that Python has.

Having optimized list.pop() for first element, then you would have a blind spot
with respect to insertion at the front... Which could then be optimized for the
cases where there is some buffer space at the front, left over from a pop. I
think it would just be harder to understand and harder to explain. And I think
that due to that, as usual people would build their own elaborate
"explanations", with erroneous conclusions, and would then use it in inefficient
ways (like, popping from the middle or inserting at the front).

I think the "harder to use correctly" is the strongest argument against the
optimization, but there is also a non-obvious *memory overhead*...

Having popped just one element at the front, in order for the implementation to
not /accumulate/ unused memory, that is, in order to avoid an ongoing /memory
leak/, extending the buffer to accomodate e.g. an append() can no longer be done
as a (on relevant systems) constant time reallocation (letting the OS do its
virtual memory paging magic). Instead, a new buffer would have to be allocated
and all data copied over. One could still have amortized constant time for
appends by using a doubling buffer (which is the strategy used in C++
'std::vector'), but at the cost of wasting some memory, a percentage...

The implementation as a pure array is easy to understand and use correctly, and
doesn't waste memory.

But it would IMHO have been better if it wasn't called "list", which brings in
the wrong associations for someone used to other languages. The name does
matter. E.g. I don't think this discussion about a pop optimization would have
been there except for the name, which makes that optimization sound reasonable.
Perhaps some standard alternative name could be devised. Like, "array" would
have been nice, except that that's already grabbed by the array module.


> I know Python's number one concern will never be speed, but if Python
> makes an O(1) operation into an unnecessarily O(N) operation for no
> good reasons other than "it's too complicated, " or it "adds another
> pointer to the structure," or "it adds another conditional check to
> list_ass_slice for operations that aren't popping off the top," I
> think it's reasonable to challenge the design philosophy.

See above. In addition to being more difficult /to use/ correctly, that is,
being much easier to misunderstand, it incurs a memory overhead -- not the one
directly from the pop optimization, but by the requirements of buffer extension.
Essentially, as discussed above, it would then have to use a doubling buffer.


Cheers & hth.,

- Alf


== 5 of 7 ==
Date: Fri, Jan 22 2010 11:43 pm
From: Steve Howell


On Jan 22, 11:10 pm, a...@pythoncraft.com (Aahz) wrote:
>
> >I know Python's number one concern will never be speed, but if Python
> >makes an O(1) operation into an unnecessarily O(N) operation for no
> >good reasons other than "it's too complicated, " or it "adds another
> >pointer to the structure," or "it adds another conditional check to
> >list_ass_slice for operations that aren't popping off the top," I
> >think it's reasonable to challenge the design philosophy.
>
> "Rough consensus and running code."
>
> You have a good point, but nobody will ever give your idea serious
> attention until there's a patch and benchmarks.

Here is a benchmark of O(N*N) vs. O(N) for two C programs. One does
memmove; the other simply advances the pointer.

showell@showell-laptop:~$ time ./slow

real 0m1.446s
user 0m1.444s
sys 0m0.004s
showell@showell-laptop:~$ time ./fast

real 0m0.003s
user 0m0.004s
sys 0m0.000s
showell@showell-laptop:~$ diff slow.c fast.c
23,24c23
< lst.size -= 1;
< memmove(lst.p, lst.p + 1, lst.size);
---
> lst.p += 1;
showell@showell-laptop:~$ cat slow.c
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

struct ob_item {
int whatever;
};

struct list {
struct ob_item *p;
int size;
};

struct list make_list(int n)
{
struct list lst;
lst.p = malloc(n);
lst.size = n;
return lst;
}

void list_pop_left(struct list lst) {
lst.size -= 1;
memmove(lst.p, lst.p + 1, lst.size);
}

int main() {
struct list lst;
int i;
int n = 40000;
int t;

lst = make_list(n);
for (i = 0; i < n; ++i) {
list_pop_left(lst);
}
}

== 6 of 7 ==
Date: Fri, Jan 22 2010 11:58 pm
From: Steve Howell


On Jan 22, 11:10 pm, a...@pythoncraft.com (Aahz) wrote:
> In article <83082e19-9130-45a8-91f2-8601c1fda...@22g2000yqr.googlegroups.com>,
> Steve Howell  <showel...@yahoo.com> wrote:
>
>
>
>
>
> >I really want to use list *normally* with all its perfectly good
> >semantics and reasonable implementation, except for its blind spot
> >with respect to popping the first element off the list.  The whole
> >reason I use CPython vs. C in the first place is that CPython
> >programmers can generally program basic data structures better than I
> >can.  But list.pop(0) is the exception.  And, with the possible
> >exception of dicts, lists are the most fundamental data structures
> >that Python has.
>
> >I know Python's number one concern will never be speed, but if Python
> >makes an O(1) operation into an unnecessarily O(N) operation for no
> >good reasons other than "it's too complicated, " or it "adds another
> >pointer to the structure," or "it adds another conditional check to
> >list_ass_slice for operations that aren't popping off the top," I
> >think it's reasonable to challenge the design philosophy.
>
> "Rough consensus and running code."
>
> You have a good point, but nobody will ever give your idea serious
> attention until there's a patch and benchmarks.

Another benchmark is that deques are slower than lists for accessing
elements.

showell@showell-laptop:~$ python foo.py
0.0215361118317 <- list
0.0429010391235 <- deque


import time
from collections import deque

n = 40000
lst = []
for i in range(n):
lst.append(i)

t = time.time()
for i in range(n):
lst[i]
print time.time() - t

lst = deque(lst)
t = time.time()
for i in range(n):
lst[i]
print time.time() - t

So substituting deque for list suffers not just in convenience, but
also in performance.

== 7 of 7 ==
Date: Sat, Jan 23 2010 12:13 am
From: Terry Reedy


On 1/23/2010 12:58 AM, Steve Howell wrote:

> I really want to use list *normally* with all its perfectly good
> semantics and reasonable implementation, except for its blind spot
> with respect to popping the first element off the list.

It was not designed for that. .pop() was added to lists about 10 years
ago because I asked for it (with no parameter, pop off end only) and
wrote what would now be a PEP -- and because Tim Peters later supported
the idea. Adding the optional parameter was something of an afterthought
(never publicly discussed as far as I know) for occasional use for
'short' lists where O(n) is tolerable. You have half persuaded me that
that the parameter addition was a mistake. Perhaps is is too attractice
a nuisance for some people ;=).

> The whole
> reason I use CPython vs. C in the first place is that CPython
> programmers can generally program basic data structures better than I
> can.

They have given us three options other that .pop(0).

1. listiterator
2. queue.Queue
3. collections.deque\

Why are you so stubborn about not using a data structure intended for
your use case instead of one that was not?

There is also
4. a two-list design for queues: iterator through one (or pop() from a
reversed version thereof) while appending to another; switch when the
first is empty. I plan to write it up with tests some day this year.

> I know Python's number one concern will never be speed, but if Python
> makes an O(1) operation into an unnecessarily O(N) operation for no
> good reasons other than "it's too complicated, " or it "adds another
> pointer to the structure," or "it adds another conditional check to
> list_ass_slice for operations that aren't popping off the top," I
> think it's reasonable to challenge the design philosophy.

Challenge yes, mock no.

Part of writing good basic data structures is not adding needless
complication from featuritis and not penalizing 99.99% of access to
satify a .01% need better satisfied another way.

Terry Jan Reedy

==============================================================================
TOPIC: medians for degree measurements
http://groups.google.com/group/comp.lang.python/t/ea0ad2a689de707c?hl=en
==============================================================================

== 1 of 4 ==
Date: Fri, Jan 22 2010 10:09 pm
From: Steve Howell


On Jan 22, 5:12 pm, MRAB <pyt...@mrabarnett.plus.com> wrote:
> Steve Howell wrote:
> > I just saw the thread for medians, and it reminded me of a problem
> > that I need to solve.  We are writing some Python software for
> > sailing, and we need to detect when we've departed from the median
> > heading on the leg.  Calculating arithmetic medians is
> > straightforward, but compass bearings add a twist.
>
> > The numerical median of 1, 2, 3, 4, 5, 6, 359 is 4.  But for
> > navigational purposes you would actually order the numbers 359, 1, 2,
> > 3, 4, 5, 6, so the desired median heading of the boat is actually 3.
>
> > Of course, you could express 359 better as -1 degrees to north, then
> > the sequence would be -1, 1, 2, 3, 4, 5, and 6.  And you'd be good.
>
> > But that trick does not generalize if you go south instead, as you
> > have similar issues with -179, 174, 175, 176, 177, 178, and 179.
>
> > Has anybody solved this in Python, either for compass bearings or a
> > different domain?  I can think of kind of a brute force solution where
> > you keep rotating the sequence until the endpoints are closest
> > together mod 360, but I wonder if there is something more elegant.
>
> When you read the headings clockwise the values normally increase and
> you pick the middle one. However, when you cross north the values
> decrease. You can spot whether the set of headings crosses north by
> checking whether the difference between the minimum and maximum is
> greater than 180. If it is then make the westerly headings negative,
> sort the values, and pick the middle one, adding 180 if it's negative.
>
> def compass_median(points):
>      points = sorted(points)
>      if points[-1] - points[0] > 180:
>          points = [(p - 360 if p > 180 else p) for p in points]
>          points.sort()
>      median = points[len(points) // 2]
>      return median + 360 if median < 0 else median

I like this implementation, and it would probably work 99.9999% of the
time for my particular use case. The only (very contrived) edge case
that I can think of is when you have 10 bearings to SSW, 10 bearings
to SSE, and the two outliers are unfortunately in the NE and NW
quadrants. It seems like the algorithm above would pick one of the
outliers.

Maybe the refinement to the algorithm above would be to find the mean
first, by summing sines and cosines of the bearings, taking the
quotient, and applying the arctangent. Then use the resulting angle
as the equivalent of "due north" and adjust angles to be within (-180,
180) respect to the mean, pretty much as you do in the code above,
with minor modifications.

I realize the problem as I stated it as sort of ill-defined.

The way you stated the solution made me realize more deeply that you
basically have a list that needs to be rotated either clockwise or
counterclockwise in certain situations. Instead of using the 180-
degree check to coarsely (but usually effectively) rotate your frame
of reference, you could instead use the mean to eliminate even more
edge cases.


== 2 of 4 ==
Date: Fri, Jan 22 2010 10:29 pm
From: Nobody


On Fri, 22 Jan 2010 16:09:03 -0800, Steve Howell wrote:

> I just saw the thread for medians, and it reminded me of a problem
> that I need to solve. We are writing some Python software for
> sailing, and we need to detect when we've departed from the median
> heading on the leg. Calculating arithmetic medians is
> straightforward, but compass bearings add a twist.
>
> The numerical median of 1, 2, 3, 4, 5, 6, 359 is 4. But for
> navigational purposes you would actually order the numbers 359, 1, 2,
> 3, 4, 5, 6, so the desired median heading of the boat is actually 3.
>
> Of course, you could express 359 better as -1 degrees to north, then
> the sequence would be -1, 1, 2, 3, 4, 5, and 6. And you'd be good.
>
> But that trick does not generalize if you go south instead, as you
> have similar issues with -179, 174, 175, 176, 177, 178, and 179.
>
> Has anybody solved this in Python, either for compass bearings or a
> different domain? I can think of kind of a brute force solution where
> you keep rotating the sequence until the endpoints are closest
> together mod 360, but I wonder if there is something more elegant.

First, there isn't always a solution; what would you consider to be the
median of [0, 90, 180, 270]?

In the case where the bearings are clustered, one approach is to
convert each bearing from polar to cartesian coordinates, compute the
centroid, then convert back to polar coordinates, i.e.:

from math import degrees, radians, sin, cos, atan2

def mean(bearings):
x = sum(sin(radians(a)) for a in bearings)
y = sum(cos(radians(a)) for a in bearings)
return degrees(atan2(x, y))

Then, subtract the mean from each bearing, coerce all angles into the
range -180..+180, calculate the median, add the mean, coerce back to
0..360.

def median(bearings):
m = mean(bearings)
bearings = [(a - m + 180) % 360 - 180 for a in bearings]
bearings.sort()
median = bearings[len(bearings) / 2]
median += m
median %= 360
return median

== 3 of 4 ==
Date: Fri, Jan 22 2010 11:03 pm
From: Steve Howell


On Jan 22, 10:29 pm, Nobody <nob...@nowhere.com> wrote:
> On Fri, 22 Jan 2010 16:09:03 -0800, Steve Howell wrote:
> > I just saw the thread for medians, and it reminded me of a problem
> > that I need to solve.  We are writing some Python software for
> > sailing, and we need to detect when we've departed from the median
> > heading on the leg.  Calculating arithmetic medians is
> > straightforward, but compass bearings add a twist.
>
> > The numerical median of 1, 2, 3, 4, 5, 6, 359 is 4.  But for
> > navigational purposes you would actually order the numbers 359, 1, 2,
> > 3, 4, 5, 6, so the desired median heading of the boat is actually 3.
>
> > Of course, you could express 359 better as -1 degrees to north, then
> > the sequence would be -1, 1, 2, 3, 4, 5, and 6.  And you'd be good.
>
> > But that trick does not generalize if you go south instead, as you
> > have similar issues with -179, 174, 175, 176, 177, 178, and 179.
>
> > Has anybody solved this in Python, either for compass bearings or a
> > different domain?  I can think of kind of a brute force solution where
> > you keep rotating the sequence until the endpoints are closest
> > together mod 360, but I wonder if there is something more elegant.
>
> First, there isn't always a solution; what would you consider to be the
> median of [0, 90, 180, 270]?
>

You probably posted before seeing my recent post. I agree that the
problem is ill-defined for certain cases.

> In the case where the bearings are clustered, one approach is to
> convert each bearing from polar to cartesian coordinates, compute the
> centroid, then convert back to polar coordinates, i.e.:
>
>         from math import degrees, radians, sin, cos, atan2
>
>         def mean(bearings):
>                 x = sum(sin(radians(a)) for a in bearings)
>                 y = sum(cos(radians(a)) for a in bearings)
>                 return degrees(atan2(x, y))
>
> Then, subtract the mean from each bearing, coerce all angles into the
> range -180..+180, calculate the median, add the mean, coerce back to
> 0..360.
>
>         def median(bearings):
>                 m = mean(bearings)
>                 bearings = [(a - m + 180) % 360 - 180 for a in bearings]
>                 bearings.sort()
>                 median = bearings[len(bearings) / 2]
>                 median += m
>                 median %= 360
>                 return median

Yep, that's exactly the solution I'm leaning toward now.


== 4 of 4 ==
Date: Sat, Jan 23 2010 12:20 am
From: Terry Reedy


On 1/23/2010 1:09 AM, Steve Howell wrote:
> On Jan 22, 5:12 pm, MRAB<pyt...@mrabarnett.plus.com> wrote:

> I like this implementation, and it would probably work 99.9999% of the
> time for my particular use case. The only (very contrived) edge case
> that I can think of is when you have 10 bearings to SSW, 10 bearings
> to SSE, and the two outliers are unfortunately in the NE and NW
> quadrants. It seems like the algorithm above would pick one of the
> outliers.
>
> Maybe the refinement to the algorithm above would be to find the mean
> first, by summing sines and cosines of the bearings, taking the
> quotient, and applying the arctangent. Then use the resulting angle
> as the equivalent of "due north" and adjust angles to be within (-180,
> 180) respect to the mean, pretty much as you do in the code above,
> with minor modifications.

I was going to suggest this. Let us know if it seems to work.

> I realize the problem as I stated it as sort of ill-defined.

Yes, I think this one reason stat books typically ignore directional
data. I think it is an unfortunate omission.

Terry Jan Reedy


==============================================================================
TOPIC: Problems with the date of modification of files on the flash drive in
windows
http://groups.google.com/group/comp.lang.python/t/dd1a415db7e08c1c?hl=en
==============================================================================

== 1 of 1 ==
Date: Fri, Jan 22 2010 11:19 pm
From: Nobody


On Fri, 22 Jan 2010 21:23:37 -0800, Aleksey wrote:

> I write crossplatform program and have next problem under windows
> (under linux is all OK) :
>
> I'm trying to get the UTC modification date of files on the flash
> drive under windows. In the flash card system is FAT.

Timestamps stored on a FAT filesystem use an unspecified timezone. When a
Windows PC writes a timestamp to a FAT filesystem, it stores the current
local time. For devices such as cameras, the timestamp will be whatever's
in the device's clock if it has one (even if it does, there's no
guarantee that it's accurate), or some arbitrary value otherwise.

When reading a timestamp, it's just used as-is, i.e. it's a local time
with no timezone specified. There is no way to convert this value to UTC
or to a specific timezone.

If you're using FAT, timestamps are pretty much meaningless unless the
device never travels between timezones (in which case, they'll probably
be roughly correct or wrong by roughly an hour either way due to DST).

When it comes to timezones, anything developed prior to NT
basically pretends that they don't exist.


==============================================================================
TOPIC: Using dictionary key as a regular expression class
http://groups.google.com/group/comp.lang.python/t/6ee5d2027516c648?hl=en
==============================================================================

== 1 of 1 ==
Date: Fri, Jan 22 2010 11:45 pm
From: Terry Reedy


On 1/22/2010 9:58 PM, Chris Jones wrote:
> On Fri, Jan 22, 2010 at 08:46:35PM EST, Terry Reedy wrote:

> Do you mean I should just read the file one character at a time?

Whoops, my misdirection (you can .read(1), but this is s l o w.
I meant to suggest processing it a char at a time.

1. If not too big,

for c in open(x, 'rb').read() # left .read() off
# 'b' will get bytes, though ord(c) same for ascii chars for byte or
unicode

2. If too big for that,

for line in open():
for c in line: # or left off this part


>> To only count ascii chars, as should be the case for C code,
>>
>> achars = [0]*63
>> for c in open('xxx', 'c'):
>> try:
>> achars[ord(c)-32] += 1
>> except IndexError:
>> pass
>>
>> for i,n in enumerate(achars)
>> print chr(i), n
>>
>> or sum subsets as desired.
>
> Thanks much for the snippet, let me play with it and see if I can come
> up with a Unicode/utf-8 version.. since while I'm at it I might as well
> write something a bit more general than C code.
>
> Since utf-8 is backward-compatible with 7bit ASCII, this shouldn't be
> a problem.

For any extended ascii, use larger array without decoding (until print,
if need be). For unicode, add encoding to open and 'c in line' will
return unicode chars. Then use *one* dict or defaultdict. I think
something like

from collections import defaultdict
d = defaultdict(int)
...
d[c] += 1 # if c is new, d[c] defaults to int() == 0

Terry Jan Reedy

==============================================================================

You received this message because you are subscribed to the Google Groups "comp.lang.python"
group.

To post to this group, visit http://groups.google.com/group/comp.lang.python?hl=en

To unsubscribe from this group, send email to comp.lang.python+unsubscribe@googlegroups.com

To change the way you get mail from this group, visit:
http://groups.google.com/group/comp.lang.python/subscribe?hl=en

To report abuse, send email explaining the problem to abuse@googlegroups.com

==============================================================================
Google Groups: http://groups.google.com/?hl=en

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home


Real Estate