How to Speed up a Python Program 114,000 times

Python speedup program

Coming to the subject of Python speedup program you generally need to know what python program is, Python is very eminent, object oriented programming language, created by Guido van Rossum.

It is simple and easy to use the structure of statements in computer language(syntax). It is for the people who are new to computer language, it tries to make perfect for them.

Making a serious data collection program run 114,000 times faster is one thing and Optimization is one totally another thing

The speedups:

  • Hoist invariant code and precompute values:

Original code had duplicate facts, magic numbers, poor structure. Two weeks to clean it up, reveal underlying structure, precompute four columns. Duplicate facts are centralized. It is faster and more readable(2-5x speedup)

  • Introduce pipeline parallelism:

Split program into Fetcher to Analyzer to writer. It is faster due to overlap of RDBMS and compute, has simple programs. Quicker re-runs(2x speedup less than over time)

  • Use numpy effectively:

fix numpy for speed. Convert database result set from string , floats and ints to all floats (using a lookup table for the strings). Pyobject w/float  for full speed numpy is replaced(8x speedup)

  • Parallelize with multiprocessing:

Use multiprocessing. Switch from one process to #-of-cores processes(not n processes)!.

Use fork() w/o exec() ..or..

Use shared memory(near linear speedup).

Eliminate copying big data:

       Firstly, reduce the row width from 84-36 bytes. Then eliminate all copying of the big data. Eliminate random reads, sequential writes.

Takes advantage of the unimportant of row order. Use radix sort to count how many times each input row should appear in the synthetic data set.

Reduce time for Random number generation(RNG):

The next target-reduced user time exposed system time. Tracing showed too many calls to random.

Touch big data once by swapping loops:

do 500 passes over the big data with one sequential read byswapping the middle and inner loops.(500/ no.of crores).

Use cython and hand-optimize the C code:

Hand-optimize 62 lines of c code.

Permute column summing order to help the LiD cache.

Cythonize the compute kernel.

Future speedups(potential):

  • Use faster hardware: more crores, more cache, more GHz
  • Replace bit-valued byte columns with one bit-masked columns with one bit-masked column to cut row width from 36 to 30 bytes
  • Use CPU vector instructions
  • Rewrite the compute kernel in assembler
  • Use Linux API calls to bind RAM allocation by socket
  • Port to GPU /LRB using the GPU library, then primitives
  • Clusterize

Fast code is the cheap code, unleash and understand your machine. It is incredibly fast. Speed up your code first.

Clusterize as a last resort. If your business is real-time, your software should be too. Make sure your tools are working properly.

Do the coding, it is not boring.

Harness the power of Python!

Leave a Reply

Your email address will not be published. Required fields are marked *