README_TWPC.txt
author Colas Nahaboo <colas@nahaboo.net>
Wed Sep 24 13:24:27 2008 +0200 (2 weeks ago)
changeset 205 cce9b833fbe9
parent 200ab21999cd501
permissions -rw-r--r--
todos
        1 ---+ TWPC, aka theTWiki Public Cache
        2 Colas Nahaboo http://colas.nahaboo.net
        3 This readme is a "under the hood" work document. The official page is at
        4 http://twiki.org/cgi-bin/view/TWiki/PublicCacheAddOn
        5 mercurial repo http://hg.colas.nahaboo.net/twiki-colas/twpc
        6 SVN via http://develop.twiki.org/~twiki4/cgi-bin/view/Bugs/Item5551
        7 
        8 Files:
        9    * pccr Cache Reader, shell version. Slower
       10    * pccr.c C version for speed 
       11    * pcbd Cache Builder, called by pccr on cache misses
       12    * pccl Cache Cleaner, run by crontab to clear cache after edits
       13    * pcad ADmin commands, web based
       14    * pcal Log analyzer to determine best settings of twpc from past usage
       15      (not yet written)
       16    * pcge script to build all pages, called by pcad
       17    * PublicCacheAddOn.txt  User/admin Documentation as a twiki page
       18    * PublicCachePlugin.pm PublicCachePlugin.txt perl module to trigger cache
       19      invalidation on topic change
       20    * README_TWPC.txt this file, internal dev info
       21    * install uninstall make-pc-config: installation management
       22    * make-distrib make-hg-revision: build system for dev
       23 Generated files by install:
       24    * vief is a copy of the original TWiki bin/view used to build cached pages
       25      bin/view is replaced bin pccr.
       26      a backup is made in pc-view-backup, just in case...
       27    * pccr.bin
       28    * pc-config a "compilation" of lib/LocalSite.cfg settings
       29    * pc-options keep track of last used options on install
       30 
       31 Cache files:
       32    * cache resides in working/public_cache/cache
       33    * inside, there are one folder per web, same name
       34    * and files with radix the topic name, and extensions:
       35       * .tx    uncompressed plain version (including CGI HTTP header)
       36       * .gz    same, compressed (including CGI HTTP header)
       37       * .nc    nocache: do not attempt to cache it
       38       * .lk    lock file: the cache is being (re)built by a process
       39    * at cache root, directory _tmp holds temporary files used to build caches
       40      in named process_id + extensions:
       41       * .raw   raw output of TWiki, then uncompressed cache
       42       * .mod   modified output
       43       * .gz    compressed cache
       44    * at cache root, directory _changers contains the IPs (one file per IP,
       45      named as the IP) of editors. The file has the modification time of last 
       46      edit
       47    * at cache root, directory _expire contains web/topic empty files whose
       48      date indicates the time at which the cache should be removed by pccl 
       49      for this page (the file have thus a date in the close future)
       50    * cache clear is done by moving cache into cache.a_number, and removing it
       51      30 seconds after, to avoid race conditions and errors that removing a
       52      directory under the feet of build processes could cause
       53 
       54 Log files:
       55    * in the same dir as twiki log files (data/)
       56    * if -q was not given, logs cache hits in the normal twiki logs
       57      with user agent cached,gzip or cached
       58    * twpc-debug.txt logs lots of misc info for debug, only in -v was 
       59      specified on install
       60    * twpc-warnings.txt logs abnormal, but not fatal, conditions:
       61      * LOCK_TIMEOUT pcbd waited to long and decided to break log
       62      * LOCK_MISSING some race condition occurred
       63      * NOT_BUILT_ERR building attemp resulted in an error other than 
       64        access denied
       65 
       66 In case of twpc update:
       67    * if view is pccr, that means we have a working twpc install
       68       * we copy all files
       69    * if no pccr file, or view is not pccr, we have a normal/updated twiki
       70       * we copy all files, mv view to vief, copy pccr to view
       71 
       72 Debug messages tracing various steps in data/twpc-debug.txt: (warning: this
       73 list is obsolete)
       74    * HIT file: cache hit (HIT_GZ for gzipped)
       75    * BYPASS_QS url: cache ignored as we have a query string ?x=y in url
       76    * BYPASS_NC url: cache ignored as url was marked as not cacheable
       77         (protected?)
       78    * BUILT url: cache build for url
       79    * NOT_BUILT_ERR url: error in getting URL, marking it as not cacheable
       80    * NOT_BUILT_AUTH url: URL read-protected, marking it as not cacheable
       81    * WAITED n url: waited n seconds for a previous build
       82    * MISS: cache miss, followed by either BUILT or NOT_BUILT
       83    * WAIT id n url: waits for lock for n seconds
       84 
       85 ISSUES:
       86    * in a pccr web request, we may end up calling another url on same TWiki
       87      by wget: we could thus deadlock the server if all
       88      the requests are stuck this way. 
       89      Advise user to raise the number of apache children. However, this should
       90      never happen in actual cases, and anyway apache will timeout eventually.
       91    * link in view to edit?t=%GMTIME{"$epoch"} would normally render the pages
       92      uncachables (would get dirty each second). but it appears that browsers
       93      do not cache as soon as there is a query string so we dont care
       94      to provide this functionality
       95    * install/update/uninstall clears the whole cache, we don't try to
       96      determine the ones that really are dirty. better safe than sorry.
       97 
       98 TESTS:
       99 with --compressed will use gzip
      100 
      101 i=1000;while let 'i-->0';do curl --compressed -s http://wikidev.nahaboo.org/TWiki/TWikiVariables >/dev/null& done
      102 
      103 PCCR ALGORITHM VERSIONS
      104    * v1 header is in file. tries in order ?query, .gz, .tx, .nc
      105    * v2 when editing our IP is marked as a "changer"
      106       * views from this IP bypasses cache
      107       * after a timeout "cleargrace" (default 17 mn) with no more edit from
      108         this IP, cache is reset, if all editors have also not edited for
      109 	at least "cleargracemin" (default 3mn)
      110    * v3 introduced the PCACHEEXPTIME TWiki tag
      111    * v4 used the PUBLIC_CACHE_EXPIRE TWiki var
      112 
      113 TODO:
      114    * can it be installed and manage the cache without being active?
      115    * see if other modules can store ntheir cache in twpc dir
      116    * can trigger external command on cache clear?
      117    * ? obey if-modified-since
      118    * should work on sites with .pl extensions
      119    * pcbd could cd to cache first, to avoid half building things if a cache
      120      clear happens in mid-build
      121    * pcad clear should be callable from cli, 
      122       * Plugin should use it directly, optionally use wget for mod_perl
      123       * scripts could call it to trigger a change (write) e.g. blog-generate
      124    * document how other modules/scripts could use the cache
      125    * pcge -v should not list private pages?
      126    * just after login we are redirected to vief
      127    * detailed stats: 
      128       * logs, uncacheable pages, expires. some terse stats moved in menu
      129       * stats menu then holds more detailed stats: stats per web
      130      decoding it from wget
      131    * make-distrib should
      132       * commit in SVN
      133       * deploy Todo & Implementation ,txt pages as wiki pages
      134    * pcal, log analysis
      135    * option -s space-efficient: only store gzipped version, unzip on the
      136      demand. For C, use zlib to inflate.
      137    * generational cache: if we know we are doomed, where to build new pages?
      138      in the new cache?
      139      A solution: 
      140       * pccr: if a changer use cache=cache_changers, including pcbd calls
      141       * plugin: on write, clear cache_changers, create a new 
      142       * on changers expire, clear cache, mv cache_changers as cache
      143    * pcad command to clean all cache pages older than ...
      144    * C version: make 2 versions, one checking for changer IP and one not
      145      make PublicCachePlugin install the first, and cache clear the 2nd
      146      variant: change a byte in executable binary
      147 
      148 MAYBE TODO:
      149    * ? see if we can get the mime-type header from the View.pm patch instead of
      150    * ? option for let logged people passthrough cache (how to detect them?)
      151    * ? put twpc files into a dir other than bin/? cgi/? (but what about view?)
      152    * ? optional expire header
      153    * ? background crawling process to add & refresh an expire header to the
      154        cached pages, for the "ok now the site is final" moment
      155    * ? make an apache-based pccr, with rewite rules? see:
      156       * http://mail-archives.apache.org/mod_mbox/httpd-users/200701.mbox/%3C1C80FD8A7D2B2745B0396F4D2D0565B401AE4C6D@apwmsg01.alc.ca%3E
      157    * ? make a proper generic TrackChangesPlugin and use it: can call hooks,
      158        logs unix style: linenum isodate who action web.topic IP [attachment]
      159        list all actions (call to writeLog). convert script. per day?
      160    * ? check we could force cache in one language / localisation?
      161    * ? cache directive in html comments in pages? (to set Expire per page)
      162