twitter: comp.lang.python - 26 new messages in 9 topics

comp.lang.python
http://groups.google.com/group/comp.lang.python?hl=en

comp.lang.python@googlegroups.com

Today's topics:

* Python example source code - 4 messages, 3 authors
http://groups.google.com/group/comp.lang.python/t/8594472fa123f77e?hl=en
* Problem writing some strings (UnicodeEncodeError) - 6 messages, 3 authors
http://groups.google.com/group/comp.lang.python/t/08d2e6a4bc11d1c3?hl=en
* Data peeping function? - 1 messages, 1 author
http://groups.google.com/group/comp.lang.python/t/c57f6b9cfd4b7ca0?hl=en
* 'Straße' ('Strasse') and Python 2 - 1 messages, 1 author
http://groups.google.com/group/comp.lang.python/t/93ddbbff468ab95d?hl=en
* python first project - 1 messages, 1 author
http://groups.google.com/group/comp.lang.python/t/9e82659cb027cf95?hl=en
* Open Question - I'm a complete novice in programming so please bear with me..
.Is python equivalent to C, C++ and java combined? - 1 messages, 1 author
http://groups.google.com/group/comp.lang.python/t/eaf393e9028e2f09?hl=en
* efficient way to process data - 10 messages, 4 authors
http://groups.google.com/group/comp.lang.python/t/f7c2c58424bf2b3e?hl=en
* python query on firebug extention - 1 messages, 1 author
http://groups.google.com/group/comp.lang.python/t/ed10a892ce2c4afa?hl=en
* Python: 404 Error when trying to login a webpage by using 'urllib' and '
HTTPCookieProcessor' - 1 messages, 1 author
http://groups.google.com/group/comp.lang.python/t/bf0e28e020a02c69?hl=en

==============================================================================
TOPIC: Python example source code
http://groups.google.com/group/comp.lang.python/t/8594472fa123f77e?hl=en
==============================================================================

== 1 of 4 ==
Date: Sun, Jan 12 2014 8:59 am
From: ngangsia akumbo

On Sunday, January 12, 2014 5:38:03 PM UTC+1, Joel Goldstick wrote:
> On Sun, Jan 12, 2014 at 10:13 AM, ngangsia akumbo <ngangsia@gmail.com> wrote:
>

> Don't forget Python Module of the Week pymotw.com/

Thanks

== 2 of 4 ==
Date: Sun, Jan 12 2014 9:06 am
From: ngangsia akumbo

On Sunday, January 12, 2014 5:52:19 PM UTC+1, Emile van Sebille wrote:
> On 01/12/2014 06:37 AM, ngangsia akumbo wrote:
>

> I'd recommend http://effbot.org/librarybook/ even though it's v2

specific and somewhat dated.

Thank very much , it is very nice

== 3 of 4 ==
Date: Sun, Jan 12 2014 9:47 am
From: memilanuk

On 01/12/2014 06:37 AM, ngangsia akumbo wrote:
> where can i find example source code by topic?
> Any help please
>

nullege.com is usually helpful...

== 4 of 4 ==
Date: Sun, Jan 12 2014 3:16 pm
From: Denis McMahon

On Sun, 12 Jan 2014 06:37:18 -0800, ngangsia akumbo wrote:

> where can i find example source code by topic?
> Any help please

You don't want to be looking at source code yet, you want to be talking
to the users of the system you're trying to design to find out what their
requirements are.

--
Denis McMahon, denismfmcmahon@gmail.com

==============================================================================
TOPIC: Problem writing some strings (UnicodeEncodeError)
http://groups.google.com/group/comp.lang.python/t/08d2e6a4bc11d1c3?hl=en
==============================================================================

== 1 of 6 ==
Date: Sun, Jan 12 2014 8:55 am
From: Emile van Sebille

On 01/12/2014 07:36 AM, Paulo da Silva wrote:
> Hi!
>
> I am using a python3 script to produce a bash script from lots of
> filenames got using os.walk.
>
> I have a template string for each bash command in which I replace a
> special string with the filename and then write the command to the bash
> script file.
>
> Something like this:
>
> shf=open(bashfilename,'w')
> filenames=getfilenames() # uses os.walk
> for fn in filenames:
> ...
> cmd=templ.replace("<fn>",fn)
> shf.write(cmd)
>
> For certain filenames I got a UnicodeEncodeError exception at
> shf.write(cmd)!
> I use utf-8 and have # -*- coding: utf-8 -*- in the source .py.
>
> How can I fix this?

Not sure exactly, but I'd try

shf=open(bashfilename,'wb')

as a start.

HTH,

Emile

== 2 of 6 ==
Date: Sun, Jan 12 2014 9:51 am
From: Paulo da Silva

Em 12-01-2014 16:23, Peter Otten escreveu:
> Paulo da Silva wrote:
>
>> I am using a python3 script to produce a bash script from lots of
>> filenames got using os.walk.
>>
>> I have a template string for each bash command in which I replace a
>> special string with the filename and then write the command to the bash
>> script file.
>>
>> Something like this:
>>
>> shf=open(bashfilename,'w')
>> filenames=getfilenames() # uses os.walk
>> for fn in filenames:
>> ...
>> cmd=templ.replace("<fn>",fn)
>> shf.write(cmd)
>>
>> For certain filenames I got a UnicodeEncodeError exception at
>> shf.write(cmd)!
>> I use utf-8 and have # -*- coding: utf-8 -*- in the source .py.
>>
>> How can I fix this?
>>
>> Thanks for any help/comments.
>
> You make it harder to debug your problem by not giving the complete
> traceback. If the error message contains 'surrogates not allowed' like in
> the demo below
>
>>>> with open("tmp.txt", "w") as f:
> ... f.write("\udcef")
> ...
> Traceback (most recent call last):
> File "<stdin>", line 2, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcef' in
> position 0: surrogates not allowed

That is the situation. I just lost it and it would take a few houres to
repeat the situation. Sorry.

>
> you have filenames that are not valid UTF-8 on your harddisk.
>
> A possible fix would be to use bytes instead of str. For that you need to
> open `bashfilename` in binary mode ("wb") and pass bytes to the os.walk()
> call.
This is my 1st time with python3, so I am confused!

As much I could understand it seems that os.walk is returning the
filenames exactly as they are on disk. Just bytes like in C.

My template is a string. What is the result of the replace command? Is
there any change in the filename from os.walk contents?

Now, if the result of the replace has the replaced filename unchanged
how do I "convert" it to bytes type, without changing its contents, so
that I can write to the bashfile opened with "wb"?

>
> Or you just go and fix the offending names.
This is impossible in my case.
I need a bash script with the names as they are on disk.

== 3 of 6 ==
Date: Sun, Jan 12 2014 10:50 am
From: Peter Otten <__peter__@web.de>

Paulo da Silva wrote:

> Em 12-01-2014 16:23, Peter Otten escreveu:
>> Paulo da Silva wrote:
>>
>>> I am using a python3 script to produce a bash script from lots of
>>> filenames got using os.walk.
>>>
>>> I have a template string for each bash command in which I replace a
>>> special string with the filename and then write the command to the bash
>>> script file.
>>>
>>> Something like this:
>>>
>>> shf=open(bashfilename,'w')
>>> filenames=getfilenames() # uses os.walk
>>> for fn in filenames:
>>> ...
>>> cmd=templ.replace("<fn>",fn)
>>> shf.write(cmd)
>>>
>>> For certain filenames I got a UnicodeEncodeError exception at
>>> shf.write(cmd)!
>>> I use utf-8 and have # -*- coding: utf-8 -*- in the source .py.
>>>
>>> How can I fix this?
>>>
>>> Thanks for any help/comments.
>>
>> You make it harder to debug your problem by not giving the complete
>> traceback. If the error message contains 'surrogates not allowed' like in
>> the demo below
>>
>>>>> with open("tmp.txt", "w") as f:
>> ... f.write("\udcef")
>> ...
>> Traceback (most recent call last):
>> File "<stdin>", line 2, in <module>
>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcef' in
>> position 0: surrogates not allowed
>
> That is the situation. I just lost it and it would take a few houres to
> repeat the situation. Sorry.
>
>
>>
>> you have filenames that are not valid UTF-8 on your harddisk.
>>
>> A possible fix would be to use bytes instead of str. For that you need to
>> open `bashfilename` in binary mode ("wb") and pass bytes to the os.walk()
>> call.
> This is my 1st time with python3, so I am confused!
>
> As much I could understand it seems that os.walk is returning the
> filenames exactly as they are on disk. Just bytes like in C.

No, they are decoded with the preferred encoding. With UTF-8 that can fail,
and if it does the surrogateescape error handler replaces the offending
bytes with special codepoints:

>>> import os
>>> with open(b"\xe4\xf6\xfc", "w") as f: f.write("whatever")
...
8
>>> os.listdir()
['\udce4\udcf6\udcfc']

You can bypass the decoding process by providing a bytes argument to
os.listdir() (or os.walk() which uses os.listdir() internally):

>>> os.listdir(b".")
[b'\xe4\xf6\xfc']

To write these raw bytes into a file the file has of course to be binary,
too.

> My template is a string. What is the result of the replace command? Is
> there any change in the filename from os.walk contents?
>
> Now, if the result of the replace has the replaced filename unchanged
> how do I "convert" it to bytes type, without changing its contents, so
> that I can write to the bashfile opened with "wb"?
>
>
>>
>> Or you just go and fix the offending names.
> This is impossible in my case.
> I need a bash script with the names as they are on disk.

I think instead of the hard way sketched out above it will be sufficient to
specify the error handler when opening the destination file

shf = open(bashfilename, 'w', errors="surrogateescape")

but I have not tried it myself. Also, some bytes may need to be escaped,
either to be understood by the shell, or to address security concerns:

>>> import os
>>> template = "ls <fn>"
>>> for filename in os.listdir():
... print(template.replace("<fn>", filename))
...
ls foo; rm bar

== 4 of 6 ==
Date: Sun, Jan 12 2014 11:41 am
From: Paulo da Silva

>
> I think instead of the hard way sketched out above it will be sufficient to
> specify the error handler when opening the destination file
>
> shf = open(bashfilename, 'w', errors="surrogateescape")
This seems to fix everything!
I tried with a small test set and it worked.

>
> but I have not tried it myself. Also, some bytes may need to be escaped,
> either to be understood by the shell, or to address security concerns:
>

Since I am puting the file names between "", the only char that needs to
be escaped is the " itself.

I'm gonna try with the real thing.

Thank you very much for the fixing and for everything I have learned here.

== 5 of 6 ==
Date: Sun, Jan 12 2014 12:29 pm
From: Peter Otten <__peter__@web.de>

Paulo da Silva wrote:

>> but I have not tried it myself. Also, some bytes may need to be escaped,
>> either to be understood by the shell, or to address security concerns:
>>
>
> Since I am puting the file names between "", the only char that needs to
> be escaped is the " itself.

What about the escape char?

== 6 of 6 ==
Date: Sun, Jan 12 2014 3:53 pm
From: Paulo da Silva

Em 12-01-2014 20:29, Peter Otten escreveu:
> Paulo da Silva wrote:
>
>>> but I have not tried it myself. Also, some bytes may need to be escaped,
>>> either to be understood by the shell, or to address security concerns:
>>>
>>
>> Since I am puting the file names between "", the only char that needs to
>> be escaped is the " itself.
>
> What about the escape char?
>
Just this fn=fn.replace('"','\\"')

So far I didn't find any problem, but the script is still running.

==============================================================================
TOPIC: Data peeping function?
http://groups.google.com/group/comp.lang.python/t/c57f6b9cfd4b7ca0?hl=en
==============================================================================

== 1 of 1 ==
Date: Sun, Jan 12 2014 9:36 am
From: Thor Whalen

The first thing I do once I import new data (as a pandas dataframe) is to .head() it, .describe() it, and then kick around a few specific stats according to what I see.

But I'm not satisfied with .describe(). Amongst others, non-numerical columns are ignored, and off-the-shelf stats will be computed for any numerical column.

I've been shopping around for a "data peeping" function that would:

(1) Have a hands-off mode where simply typing
diagnose_this(data)
the function would figure things out on its own, and notify me when in doubt. For example, would assume that any string data with not too many unique values should be considered categorical and appropriate statistics erected.

(2) Perform standard diagnoses and print them out. For example, (a) missing values? (b) heterogeneously formatted data? (c) columns with only one unique value? etc.

(3) Be parametrizable, if I so choose.

Does anyone know of such a function?

==============================================================================
TOPIC: 'Straße' ('Strasse') and Python 2
http://groups.google.com/group/comp.lang.python/t/93ddbbff468ab95d?hl=en
==============================================================================

== 1 of 1 ==
Date: Sun, Jan 12 2014 10:33 am
From: MRAB

On 2014-01-12 08:31, Peter Otten wrote:
> wxjmfauth@gmail.com wrote:
>
>>>>> sys.version
>> 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]
>>>>> s = 'Straße'
>>>>> assert len(s) == 6
>>>>> assert s[5] == 'e'
>>>>>
>>
>> jmf
>
> Signifying nothing. (Macbeth)
>
> Python 2.7.2+ (default, Jul 20 2012, 22:15:08)
> [GCC 4.6.1] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> s = "Straße"
>>>> assert len(s) == 6
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> AssertionError
>>>> assert s[5] == "e"
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> AssertionError
>
>
The point is that in Python 2 'Straße' is a bytestring and its length
depends on the encoding of the source file. If the source file is UTF-8
then 'Straße' is a string literal with 7 bytes between the single
quotes.

==============================================================================
TOPIC: python first project
http://groups.google.com/group/comp.lang.python/t/9e82659cb027cf95?hl=en
==============================================================================

== 1 of 1 ==
Date: Sun, Jan 12 2014 10:50 am
From: MRAB

On 2014-01-12 06:04, Chris Angelico wrote:
> On Sun, Jan 12, 2014 at 4:14 PM, ngangsia akumbo <ngangsia@gmail.com>
> wrote:
>> What options do you think i can give the Ceo. Because from what you
>> have outline, i think i will like to follow your advice.
>>
>> If it is just some recording data stuff then some spreadsheet can
>> do the work.
>>
>> From all indication it is a very huge project.
>>
>> How much do you thing all this will cost if we were to put the
>> system all complete.
>
> If you currently do all your bills and things on paper, then this
> job is going to be extremely daunting. Even if you don't write a
> single line of code (ie you buy a ready-made system), you're going to
> have to convert everybody to doing things the new way. In that case,
> I would recommend getting some people together to discuss exactly
> what you need to do, and then purchase an accounting, warehousing, or
> inventory management system, based on what you actually need it to
> do.
>
> On the other hand, if it's already being done electronically, your
> job is IMMENSELY easier. Easier, but more complex to describe,
> because what you're really asking for is a program that will get
> certain data out of your accounting/inventory management system and
> display it. The difficulty of that job depends entirely on what
> you're using for that data entry.
>
You should also consider whether you need to do it all at once or could
do it incrementally. Look at what functionality you might want and where
you might get the greatest benefit and start there. Doing it that way
will reduce the chances of you committing a lot of resources (time and
money) building a system, only to find at the end that you either left
something out or added something that you didn't really need after all.

==============================================================================
TOPIC: Open Question - I'm a complete novice in programming so please bear
with me...Is python equivalent to C, C++ and java combined?
http://groups.google.com/group/comp.lang.python/t/eaf393e9028e2f09?hl=en
==============================================================================

== 1 of 1 ==
Date: Sun, Jan 12 2014 10:53 am
From: Grant Edwards

On 2014-01-11, pintreo mardi <bigearl497@outlook.com> wrote:

> Hi, I've just begun to learn programming, I have an open question for
> the group: Is the Python language an all in one computer language
> which could replace C, C++, Java etc..

No. Python can not replace C in a number of application areas:

* Bare-metal applications without an OS.

* Low-resource applications with limited memory (like a few KB).

* Device driver and kernel modules for OSes like Linux, Unix, (and,
AFAIK, Windows).

* Computationally intensive applications where there isn't a library
available written C or FORTRAN to do the heavy lifting.

For general application programming on a server or PC, then Python can
replace many/most uses of C/C++/Java.

--
Grant Edwards grant.b.edwards Yow! Look into my eyes and
at try to forget that you have
gmail.com a Macy's charge card!

==============================================================================
TOPIC: efficient way to process data
http://groups.google.com/group/comp.lang.python/t/f7c2c58424bf2b3e?hl=en
==============================================================================

== 1 of 10 ==
Date: Sun, Jan 12 2014 11:23 am
From: Larry Martell

I have an python app that queries a MySQL DB. The query has this form:

SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f
FROM t
GROUP BY a, b, c, d, f

x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or
10053.490, 2542.094).

The business issue is that if either x or y in 2 rows that are in the
same a, b, c, d group are within 1 of each other then they should be
grouped together. And to make it more complicated, the tolerance is
applied as a rolling continuum. For example, if the x and y in a set
of grouped rows are:

row 1: 1.5, 9.5
row 2: 2.4, 20.8
row 3: 3.3, 40.6
row 4: 4.2, 2.5
row 5: 5.1, 10.1
row 6: 6.0, 7.9
row 7: 8.0, 21.0
row 8: 100, 200

1 through 6 get combined because all their X values are within the
tolerance of some other X in the set that's been combined. 7's Y value
is within the tolerance of 2's Y, so that should be combined as well.
8 is not combined because neither the X or Y value is within the
tolerance of any X or Y in the set that was combined.

AFAIK, there is no way to do this in SQL. In python I can easily parse
the data and identify the rows that need to be combined, but then I've
lost the ability to calculate the average and std across the combined
data set. The only way I can think of to do this is to remove the
grouping from the SQL and do all the grouping and aggregating myself.
But this query often returns 20k to 30k rows after grouping. It could
easily be 80k to 100k rows or more that I have to process if I remove
the grouping and I think that will end up being very slow.

Anyone have any ideas how I can efficiently do this?

Thanks!
-larry

== 2 of 10 ==
Date: Sun, Jan 12 2014 11:53 am
From: Petite Abeille

On Jan 12, 2014, at 8:23 PM, Larry Martell <larry.martell@gmail.com> wrote:

> AFAIK, there is no way to do this in SQL.

Sounds like a job for window functions (aka analytic functions) [1][2].

[1] http://www.postgresql.org/docs/9.3/static/tutorial-window.html
[2] http://docs.oracle.com/cd/E11882_01/server.112/e26088/functions004.htm#SQLRF06174

== 3 of 10 ==
Date: Sun, Jan 12 2014 2:18 pm
From: Chris Angelico

On Mon, Jan 13, 2014 at 6:53 AM, Petite Abeille
<petite.abeille@gmail.com> wrote:
> On Jan 12, 2014, at 8:23 PM, Larry Martell <larry.martell@gmail.com> wrote:
>
>> AFAIK, there is no way to do this in SQL.
>
> Sounds like a job for window functions (aka analytic functions) [1][2].

That's my thought too. I don't think MySQL has them, though, so it's
either going to have to be done in Python, or the database back-end
will need to change. Hard to say which would be harder.

ChrisA

== 4 of 10 ==
Date: Sun, Jan 12 2014 2:43 pm
From: Dennis Lee Bieber

On Sun, 12 Jan 2014 14:23:17 -0500, Larry Martell <larry.martell@gmail.com>
declaimed the following:

>I have an python app that queries a MySQL DB. The query has this form:
>
>SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f
>FROM t
>GROUP BY a, b, c, d, f
>
>x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or
>10053.490, 2542.094).
>

Decimal (Numeric) or floating/real. If the latter, the internal storage
may not be exact (378.1811111111 and 378.179999999 may both "display" as
378.18, but will not match for grouping).

>The business issue is that if either x or y in 2 rows that are in the
>same a, b, c, d group are within 1 of each other then they should be
>grouped together. And to make it more complicated, the tolerance is
>applied as a rolling continuum. For example, if the x and y in a set
>of grouped rows are:
>
As I understand group by, it will first group by "a", WITHIN the "a"
groups it will then group by "b"... Probably not a matter germane to the
problem as you are concerning yourself with the STRING representation of
"x" and "y" with a comma delimiter -- which is only looked at if the
"a,b,c,d" are equal... Thing is, a string comparison is going to operate
strictly left to right -- it won't even see your "y" value unless all the
"x" value is equal.

You may need to operate using subselects... So that you can specify
something like

where abs(s1.x -s2.x) < tolerance or abs(s1.y-s2.y) < tolerance
and (s1.a = s2.a ... s1.d = s2.d)

s1/s1 are the subselects (you may need a primary key <> primary key to
avoid having it output a record where the two subselects are for the SAME
record -- or maybe not, since you /do/ want that record also output). Going
to be a costly query since you are basically doing

foreach r1 in s1
foreach r2 in s2
emit r2 when...

--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/

== 5 of 10 ==
Date: Sun, Jan 12 2014 3:27 pm
From: Chris Angelico

On Mon, Jan 13, 2014 at 6:23 AM, Larry Martell <larry.martell@gmail.com> wrote:
> I have an python app that queries a MySQL DB. The query has this form:
>
> SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f
> FROM t
> GROUP BY a, b, c, d, f
>
> x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or
> 10053.490, 2542.094).
>
> The business issue is that if either x or y in 2 rows that are in the
> same a, b, c, d group are within 1 of each other then they should be
> grouped together. And to make it more complicated, the tolerance is
> applied as a rolling continuum. For example, if the x and y in a set
> of grouped rows are:
>
> row 1: 1.5, 9.5
> row 2: 2.4, 20.8
> row 3: 3.3, 40.6
> row 4: 4.2, 2.5
> row 5: 5.1, 10.1
> row 6: 6.0, 7.9
> row 7: 8.0, 21.0
> row 8: 100, 200
>
> 1 through 6 get combined because all their X values are within the
> tolerance of some other X in the set that's been combined. 7's Y value
> is within the tolerance of 2's Y, so that should be combined as well.
> 8 is not combined because neither the X or Y value is within the
> tolerance of any X or Y in the set that was combined.

Trying to get my head around this a bit more. Are columns a/b/c/d
treated as a big category (eg type, brand, category, model), such that
nothing will ever be grouped that has any difference in those four
columns? If so, we can effectively ignore them and pretend we have a
table with exactly one set (eg stick a WHERE clause onto the query
that stipulates their values). Then what you have is this:

* Aggregate based on proximity of x and y
* Emit results derived from e

Is that correct?

So here's my way of writing it.

* Subselect: List all values for x, in order, and figure out which
ones are less than the previous value plus one
* Subselect: Ditto, for y.
* Outer select: Somehow do an either-or group. I'm not quite sure how
to do that part, actually!

A PGSQL window function would cover the two subselects - at least, I'm
fairly sure it would. I can't quite get the whole thing, though; I can
get a true/false flag that says whether it's near to the previous one
(that's easy), and creating a grouping column value should be possible
from that but I'm not sure how.

But an either-or grouping is a bit trickier. The best I can think of
is to collect all the y values for each group of x values, and then if
any two groups 'overlap' (ie have points within 1.0 of each other),
merge the groups. That's going to be seriously tricky to do in SQL, I
think, so you may have to go back to Python on that one.

My analysis suggests that, whatever happens, you're going to need
every single y value somewhere. So it's probably not worth trying to
do any grouping/aggregation in SQL, since you need to further analyze
all the individual data points. I can't think of any way better than
just leafing through the whole table (either in Python or in a stored
procedure - if you can run your script on the same computer that's
running the database, I'd do that, otherwise consider a stored
procedure to reduce network transfers) and building up mappings.

Of course, "I can't think of a way" does not equate to "There is no
way". There may be some magic trick that I didn't think of, or some
arcane incantation that gets what you want. Who knows? If you can
produce an ASCII art Mandelbrot set [1] in pure SQL, why not this!

ChrisA

[1] http://wiki.postgresql.org/wiki/Mandelbrot_set

== 6 of 10 ==
Date: Sun, Jan 12 2014 7:17 pm
From: Larry Martell

On Sun, Jan 12, 2014 at 2:53 PM, Petite Abeille
<petite.abeille@gmail.com> wrote:
>
> On Jan 12, 2014, at 8:23 PM, Larry Martell <larry.martell@gmail.com> wrote:
>
>> AFAIK, there is no way to do this in SQL.
>
> Sounds like a job for window functions (aka analytic functions) [1][2].
>
> [1] http://www.postgresql.org/docs/9.3/static/tutorial-window.html
> [2] http://docs.oracle.com/cd/E11882_01/server.112/e26088/functions004.htm#SQLRF06174

Unfortunately, MySQL does not support this.

== 7 of 10 ==
Date: Sun, Jan 12 2014 7:18 pm
From: Larry Martell

On Sun, Jan 12, 2014 at 5:18 PM, Chris Angelico <rosuav@gmail.com> wrote:
> On Mon, Jan 13, 2014 at 6:53 AM, Petite Abeille
> <petite.abeille@gmail.com> wrote:
>> On Jan 12, 2014, at 8:23 PM, Larry Martell <larry.martell@gmail.com> wrote:
>>
>>> AFAIK, there is no way to do this in SQL.
>>
>> Sounds like a job for window functions (aka analytic functions) [1][2].
>
> That's my thought too. I don't think MySQL has them, though, so it's
> either going to have to be done in Python, or the database back-end
> will need to change. Hard to say which would be harder.

Changing the database is not feasible.

== 8 of 10 ==
Date: Sun, Jan 12 2014 7:25 pm
From: Larry Martell

On Sun, Jan 12, 2014 at 5:43 PM, Dennis Lee Bieber
<wlfraed@ix.netcom.com> wrote:
> On Sun, 12 Jan 2014 14:23:17 -0500, Larry Martell <larry.martell@gmail.com>
> declaimed the following:
>
>>I have an python app that queries a MySQL DB. The query has this form:
>>
>>SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f
>>FROM t
>>GROUP BY a, b, c, d, f
>>
>>x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or
>>10053.490, 2542.094).
>>
>
> Decimal (Numeric) or floating/real. If the latter, the internal storage
> may not be exact (378.1811111111 and 378.179999999 may both "display" as
> 378.18, but will not match for grouping).

In the database they are decimal. They are being converted to char by
the CONCAT(x, ',', y).

>>The business issue is that if either x or y in 2 rows that are in the
>>same a, b, c, d group are within 1 of each other then they should be
>>grouped together. And to make it more complicated, the tolerance is
>>applied as a rolling continuum. For example, if the x and y in a set
>>of grouped rows are:
>>
> As I understand group by, it will first group by "a", WITHIN the "a"
> groups it will then group by "b"... Probably not a matter germane to the
> problem as you are concerning yourself with the STRING representation of
> "x" and "y" with a comma delimiter -- which is only looked at if the
> "a,b,c,d" are equal... Thing is, a string comparison is going to operate
> strictly left to right -- it won't even see your "y" value unless all the
> "x" value is equal.

Yes, that is correct. The original requirement was to group by (X, Y),
so the CONCAT(x, ',', y) was correct and working. Then the requirement
was change to apply the tolerance as I described.

>
> You may need to operate using subselects... So that you can specify
> something like
>
> where abs(s1.x -s2.x) < tolerance or abs(s1.y-s2.y) < tolerance
> and (s1.a = s2.a ... s1.d = s2.d)
>
> s1/s1 are the subselects (you may need a primary key <> primary key to
> avoid having it output a record where the two subselects are for the SAME
> record -- or maybe not, since you /do/ want that record also output). Going
> to be a costly query since you are basically doing
>
> foreach r1 in s1
> foreach r2 in s2
> emit r2 when...

Speed is an issue here, and while the current query performs well, in
my experience subqueries and self joins do not. I'm going to try and
do it all in python and see how it performs. The other option is to
pre-process the data on the way into the database. Doing that will
eliminate some of the data partitioning as all of the data that could
be joined will be in the same input file. I'm just not sure if it will
OK to actually munge the data. I'll find that out tomorrow.

== 9 of 10 ==
Date: Sun, Jan 12 2014 7:35 pm
From: Larry Martell

On Sun, Jan 12, 2014 at 6:27 PM, Chris Angelico <rosuav@gmail.com> wrote:
> On Mon, Jan 13, 2014 at 6:23 AM, Larry Martell <larry.martell@gmail.com> wrote:
>> I have an python app that queries a MySQL DB. The query has this form:
>>
>> SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f
>> FROM t
>> GROUP BY a, b, c, d, f
>>
>> x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or
>> 10053.490, 2542.094).
>>
>> The business issue is that if either x or y in 2 rows that are in the
>> same a, b, c, d group are within 1 of each other then they should be
>> grouped together. And to make it more complicated, the tolerance is
>> applied as a rolling continuum. For example, if the x and y in a set
>> of grouped rows are:
>>
>> row 1: 1.5, 9.5
>> row 2: 2.4, 20.8
>> row 3: 3.3, 40.6
>> row 4: 4.2, 2.5
>> row 5: 5.1, 10.1
>> row 6: 6.0, 7.9
>> row 7: 8.0, 21.0
>> row 8: 100, 200
>>
>> 1 through 6 get combined because all their X values are within the
>> tolerance of some other X in the set that's been combined. 7's Y value
>> is within the tolerance of 2's Y, so that should be combined as well.
>> 8 is not combined because neither the X or Y value is within the
>> tolerance of any X or Y in the set that was combined.
>
> Trying to get my head around this a bit more. Are columns a/b/c/d
> treated as a big category (eg type, brand, category, model), such that
> nothing will ever be grouped that has any difference in those four
> columns? If so, we can effectively ignore them and pretend we have a
> table with exactly one set (eg stick a WHERE clause onto the query
> that stipulates their values). Then what you have is this:
>
> * Aggregate based on proximity of x and y
> * Emit results derived from e
>
> Is that correct?

There will be multiple groups of a/b/c/d. I simplified the query for
the purposes of posting my question. There is a where clause with
values that come from user input. None, any, or all of a, b, c, or d
could be in the where clause.

> So here's my way of writing it.
>
> * Subselect: List all values for x, in order, and figure out which
> ones are less than the previous value plus one
> * Subselect: Ditto, for y.
> * Outer select: Somehow do an either-or group. I'm not quite sure how
> to do that part, actually!
>
> A PGSQL window function would cover the two subselects - at least, I'm
> fairly sure it would. I can't quite get the whole thing, though; I can
> get a true/false flag that says whether it's near to the previous one
> (that's easy), and creating a grouping column value should be possible
> from that but I'm not sure how.
>
> But an either-or grouping is a bit trickier. The best I can think of
> is to collect all the y values for each group of x values, and then if
> any two groups 'overlap' (ie have points within 1.0 of each other),
> merge the groups. That's going to be seriously tricky to do in SQL, I
> think, so you may have to go back to Python on that one.
>
> My analysis suggests that, whatever happens, you're going to need
> every single y value somewhere. So it's probably not worth trying to
> do any grouping/aggregation in SQL, since you need to further analyze
> all the individual data points. I can't think of any way better than
> just leafing through the whole table (either in Python or in a stored
> procedure - if you can run your script on the same computer that's
> running the database, I'd do that, otherwise consider a stored
> procedure to reduce network transfers) and building up mappings.
>
> Of course, "I can't think of a way" does not equate to "There is no
> way". There may be some magic trick that I didn't think of, or some
> arcane incantation that gets what you want. Who knows? If you can
> produce an ASCII art Mandelbrot set [1] in pure SQL, why not this!
>
> ChrisA
>
> [1] http://wiki.postgresql.org/wiki/Mandelbrot_set

Thanks for the reply. I'm going to take a stab at removing the group
by and doing it all in python. It doesn't look too hard, but I don't
know how it will perform.

== 10 of 10 ==
Date: Sun, Jan 12 2014 10:09 pm
From: Chris Angelico

On Mon, Jan 13, 2014 at 2:35 PM, Larry Martell <larry.martell@gmail.com> wrote:
> Thanks for the reply. I'm going to take a stab at removing the group
> by and doing it all in python. It doesn't look too hard, but I don't
> know how it will perform.

Well, if you can't switch to PostgreSQL or such, then doing it in
Python is your only option. There are such things as GiST and GIN
indexes that might be able to do some of this magic, but I don't think
MySQL has anything even remotely like what you're looking for.

So ultimately, you're going to have to do your filtering on the
database, and then all the aggregation in Python. And it's going to be
somewhat complicated code, too. Best I can think of is this, as
partial pseudo-code:

last_x = -999
x_map = []; y_map = {}
merge_me = []
for x,y,e in (SELECT x,y,e FROM t WHERE whatever ORDER BY x):
if x<last_x+1:
x_map[-1].append((y,e))
else:
x_map.append([(y,e)])
last_x=x
if y in y_map:
merge_me.append((y_map[y], x_map[-1]))
y_map[y]=x_map[-1]

# At this point, you have x_map which is a list of lists, each one
# being one group, and y_map which maps a y value to its x_map list.

last_y = -999
for y in sorted(y_map.keys()):
if y<last_y+1:
merge_me.append((y_map[y], last_x_map))
last_y=y
last_x_map=y_map[y]

for merge1,merge2 in merge_me:
merge1.extend(merge2)
merge2[:]=[] # Empty out the list

for lst in x_map:
if not lst: continue # been emptied out, ignore it
do aggregate stats, get sum(lst) and whatever else

I think this should be linear complexity overall, but there may be a
few aspects of it that are quadratic. It's a tad messy though, and
completely untested. But that's an algorithmic start. The idea is that
lists get collected based on x proximity, and then lists get merged
based on y proximity. That is, if you have (1.0, 10.1), (1.5, 2.3),
(3.0, 11.0), (3.2, 15.2), they'll all be treated as a single
aggregation unit. If that's not what you want, I'm not sure how to
handle it.

ChrisA

==============================================================================
TOPIC: python query on firebug extention
http://groups.google.com/group/comp.lang.python/t/ed10a892ce2c4afa?hl=en
==============================================================================

== 1 of 1 ==
Date: Sun, Jan 12 2014 6:10 am
From: JAI PRAKASH SINGH

hello

i am working on selenium module of python, i know how to make
extension of firebug with selenium, but i want to know how to use
firebug extension with request module / mechanize . i search a lot
but unable to find it , please help .

i want technique similar like :-

from selenium import webdriver

fp = webdriver.FirefoxProfile()

fp.add_extension(extension='firebug-.8.4.xpi')
fp.set_preference("extensions.firebug.currentVersion", "1.8.4")
browser = webdriver.Firefox(firefox_profile=fp)

in request module or mechanize module

==============================================================================
TOPIC: Python: 404 Error when trying to login a webpage by using 'urllib' and
'HTTPCookieProcessor'
http://groups.google.com/group/comp.lang.python/t/bf0e28e020a02c69?hl=en
==============================================================================

== 1 of 1 ==
Date: Sun, Jan 12 2014 12:51 pm
From: Terry Reedy

On 1/12/2014 7:17 AM, KMeans Algorithm wrote:

> But I get a "404" error (Not Found). The page "https://www.mysite.com/loginpage" does exist

Firefox tells me the same thing. If that is a phony address, you should
have said so.

--
Terry Jan Reedy

==============================================================================

You received this message because you are subscribed to the Google Groups "comp.lang.python"
group.

To post to this group, visit http://groups.google.com/group/comp.lang.python?hl=en

To unsubscribe from this group, send email to comp.lang.python+unsubscribe@googlegroups.com

To change the way you get mail from this group, visit:
http://groups.google.com/group/comp.lang.python/subscribe?hl=en

To report abuse, send email explaining the problem to abuse@googlegroups.com

==============================================================================
Google Groups: http://groups.google.com/?hl=en

twitter

Monday, January 13, 2014

comp.lang.python - 26 new messages in 9 topics - digest

0 Comments:

Post a Comment

About Me

Previous Posts