FANDOM

2,055,230 Pages

PyWikipediaBotResourcesBots
Using pywikipediabot[1]

Pywikipediabot at Wikiversity[2]
Python Regular Expressions[3]

Regex Tutorials #1#2
ÜberBot run by Sean Colombo
Janitor run by Teknomunk
BotUm run by Aquatiki
R2-D2 run by Unaiaia
SandBot run by Redxx
S2E2 run by EchoSierra
The Notorious B.O.T. run by team a
Man-Machine run by HS
UmatBot run by Umat
Botanic run by Chris
00101010 run by 6×9
Lyra Botstrings run by Bobogoobo

Live Bot List
Bot family picture

Overview


This is a page where those with the knowledge can share regex, code, etc.

Please leave related messages and requests for help on the Bot Portal talk page.

Process To Make a New Bot

Getting a bot account

The difference of having a bot 'flag' on your account just means that your changes won't show up on the Recent Changes list by default. This is to reduce clutter since bots tend to make a ton of edits.

Once you've finished writing your bot, the general process we follow (a la Wikipedia) is to run the bot for about a day while it is still a "normal" user account at a relatively slow pace (maybe one request per minute) to allow the whole community to see the changes in the Recent Changes list. This gives a chance to have a lot of eyeballs looking at the changes to catch the occasional bug (we've all been known to write those from time to time ;)). After it looks good, just leave a message for Sean or another Bureaucrat and we'll give you a bot flag.

Please limit queries of the server to once every 1-2 seconds, and page changes once every 10-20 seconds, depending on server load. With replace.py that is part of PyWikipediaBot, this can be done by adding the arguments -sleep:2 -pt:20. In custom scripts based on PyWikipediaBot, you need to call wikipedia.handleArgs() to allow the use of -pt: or set the delay with wikipedia.put_throttle.setDelay(20, absolute = True). Alternatively you can edit config.py and change the the put_throttle option to alter the behavior globally.

Code

Please put suggestions, snippets and code (that has been tried and tested) in this section. Especially welcome would be examples of code commonly used on wikis, particularly here at LyricWiki, that others may find of use. Do not take anything for granted - what might be obvious to one, may not be so to others. Thanks!

Pywikipedia settings

Settings in user-config.py:

 family = 'lyricwiki'
 mylang = 'en'
 usernames['lyricwiki']['en'] = u'Username'

Changes to families/lyricwiki_family.py:

 class Family(family.Family):
     def __init__(self):
         family.Family.__init__(self)
         self.name = 'lyricwiki'
         self.langs = {
             'en': 'lyrics.wikia.com',
         }
 
         self.namespaces[4] = {
             '_default': [u'LyricWiki', self.namespaces[4]['_default']],
         }
         self.namespaces[5] = {
             '_default': [u'LyricWiki talk', self.namespaces[5]['_default']],
         }
 
     def version(self, code):
         return "1.14.0"
 
     def scriptpath(self, code):
         return 
 
     def path(self, code):
         return '/index.php'
 
     def apipath(self, code):
         return '/api.php'
 
     def disambcategory (self, code):
         return "Category:Disambiguation_Page"

Fixing Broken NOTOC and NOEDITSECTION

Originally posted by Redxx @22:39, 12 October 2008 (UTC):

python replace.py -start:Frank_Sinatra -sleep:1 -pt:10 "  NOTOC  " "__NOTOC__" 
python replace.py -start:Frank_Sinatra -sleep:1 -pt:10 "  NOEDITSECTION  " "__NOEDITSECTION__" 

I successfully used the above to replace the underscores that had been accidentally removed from either side of the NOTOC and NOEDITSECTION on the Frank Sinatra album pages. Although I decided not to do it this way (and therefore did not test this), I believe the regex equivalent is:

python replace.py -start:Frank_Sinatra -regex "\s{2}NO(TOC|SECTION)\s{2}" "__NO\1__"
And it might be safer to just replace: ([A-Z]{3,11}) by (TOC|SECTION). --Mischko 07:56, 18 October 2008 (UTC)
Thanks Mischko, I have updated example.  Яєdxx Actions Words 00:55, 7 August 2009 (UTC)

Fixing Song Page Ranking

Originally posted by team a @07:15, 25 October 2008 (UTC):

This command is designed to convert all song pages in Category:Review Me to pages with Green stars, as per LyricWiki:Page ranking. The pages are removed from the category, and "star=Green" is added to the page's {{Song}} template.

Basic Form

This is the stripped-down version, to explain the basics. It assumes that Category:Review Me appears at the top of the page above {{Song}}, and that the page does not already have a star of any kind. It captures everything starting with {{Song up until, but not including, the }}' at the end of the template, and adds "|star=Green}}" to the end, then puts it back into the page.

replace.py -regex "[\[Category:Review[\s_]Me\]\]\s*(\{\{Song\|[^|]*\|[^}|])\}\}" "\1|star=Green}}"

Limitations: Doesn't work with songs with featured artists, or that aren't in Category:Review Me but are missing a star, or are in Category:Review Me despite the fact that they have a star.

Advanced Form

This is the form that I actually use. It can deal with featured artists, Green and Black (if there are any) stars, and includes some redundancies in case the {{Song}} is broken, i.e. missing end brackets ([^|{}]). It also deals with songs that aren't in Category:Review Me (and I've found some). This should help fix some user errors, or confusions with the Page Ranking Policy.

replace.py -regex "(?:\[\[Category:Review[\s_]Me\]\])?\s*(\{\{Song (?:\|[^|{}]*){2}(?:\|fa\d?=[^{}|]*)*)(\|star=(Black|Green))?\}\}" "\1|star=Green}}"

Limitations: Can't deal with Category:Review Me if it occurs after the song template (which it shouldn't, as it was added to the top automatically), and can't remove pages with other star ratings from Category:Review Me.

Thanks very much to Aquatiki and Senvaikis for their help with regex.

See the development of this topic in Notorious' Archive, but please leave comments/corrections/suggestions here, on the Bot Portal talk page. Thanks. -team a

Move Genre

Originally posted by team a @07:48, 3 November 2008 (UTC):

This pywikipedia command edits all pages in one genre, removing them from that genre and adding them to a second genre instead. I'm posting it here as an example of how using an unescaped pipe (| not \|) can be used to combine what would otherwise be multiple regex statements. This can deal with both artist and album pages.

General Form

replace.py -sleep:2 -pt:20 -regex -cat:Genre/Hip-Hop "(\|\s*[Gg]enre\s*=\s*|\|\s*genre2\s*=\s*|\[\[Category:Genre/)FIRST GENRE" "\1SECOND GENRE"

Example

Move all pages in Category:Genre/Hip-Hop to Category:Genre/Hip Hop:

replace.py -sleep:2 -pt:20 -regex -cat:Genre/Hip-Hop "(\|\s*[Gg]enre\s*=\s*|\|\s*genre2\s*=\s*|\[\[Category:Genre/)Hip-Hop" "\1Hip Hop"

Set language in SongFooter

Originally posted by Hs @16:33, 10 January 2009 (UTC):

Regular expressions (in Python) for three cases: The language key exists and is empty, the language key exists and is filled, or no language key exists.

_RE_LANG_EMPTY = re.compile(r'(\{\{\s*SongFooter(?:\s+\|.*?)*?\|\s*language\s*=\s*?)((?:[\r\n]+\s*)?(?:\|.*?)*?\}\})', re.DOTALL)
_RE_LANG_EXISTENT = re.compile(r'(\{\{\s*SongFooter\s+(?:\|.*?)*?\|\s*language\s*=\s*).*?((?:[\r\n]+\s*)?(?:\|.*?)*?\}\})', re.DOTALL)
_RE_NO_LANG = re.compile(r'(\{\{\s*SongFooter\s+(?:\|.*?)*?)((?:[\r\n]+\s*)?\}\})', re.DOTALL)

And the corresponding function to set the language. Set force to True to overwrite an existing language value.

   def _setLanguage(self, text, language, force=False):
       """Set the language in the SongFooter template. Set force to True to
       override an already set language. Returns None on error.
       """
       count = 0
       if force:
           (text, count) = _RE_LANG_EXISTENT.subn("\\1%s\\2" % language, text)
       else:
           (text, count) = _RE_LANG_EMPTY.subn("\\1%s\\2" % language, text)
       if not count:
           (text, count) = _RE_NO_LANG.subn("\\1\n|language = %s\\2" % language, text)
       if count:
           return text
       else:
           None

TitleCase function

Originally posted by Hs @16:38, 10 January 2009 (UTC):

Regex to match the beginning of a word

_RE_TOUPPER = re.compile(r'(^|\s|[\"\(\[])(\w)', re.UNICODE)

And the corresponding function. name must be a Unicode object, in order that the uppercase function works for non-ASCII characters.

   def TitleCase(name):
       return _RE_TOUPPER.sub(lambda match: match.group(1) + match.group(2).upper(), name)

Asynchronous writes in pywikipedia

Originally posted by Hs @21:20, 11 February 2009 (UTC):

pywikipedia has the method Page.put_async(text) which allows to write to a page in the background. To limit the size of the queue of pages waiting to be written put something like this at the beginning of your code:

import wikipedia
import Queue
# ...
wikipedia.page_put_queue = Queue.Queue(5)

This way, if you have a sensible put throttle (e.g. 20 seconds), you can have slight concurrency aligned to the writing speed.

Orphaned pages for one artist

#!/usr/bin/python
#
# Find orphaned song pages on lyrics.wikia.com by comparing the list of pages
# with prefix <Artist>: with what is actually linked on page <Artist>.
#
# This will not work for artist pages that have been split onto several pages,
# like Rolling_Stones. Also not all pages need to be linked, e.g. translations.

import wikipedia
from wikipedia import Page
from pagegenerators import PrefixingPageGenerator

def usage():
    print("""
Usage: ./orphans.py Artist
""")


def orphans(artist):
    prefix = artist + ":"
    allPages = PrefixingPageGenerator(prefix, includeredirects = False)
    allPages = set(map(lambda p: p.title(), allPages))
    site = wikipedia.getSite()
    artistPage = Page(site, artist)
    linkedPages = artistPage.linkedPages()
    linkedPages = map(lambda p: p.title(), linkedPages)
    linkedPages = filter(lambda s: s.startswith(prefix), linkedPages)
    linkedPages = set(linkedPages)
    orphanedPages = list(allPages.difference(linkedPages))
    orphanedPages.sort()
    print("\n".join(orphanedPages))

def main():
    argv = wikipedia.handleArgs()
    if len(argv) != 1:
        usage()
        return
    orphans(unicode(argv[0])) # hopefully this is correct

if __name__ == "__main__":
    try:
        main()
    finally:
        wikipedia.stopme()

--Hfs·· 22:37, June 19, 2010 (UTC)


I changed the code for output…
    print("\n".join(orphanedPages))
…to…
    for songPage in orphanedPages:
        s = songPage.encode("Latin-1")
        print("# '''[[" + s + "|" + s[len(artist)+1:] + "]]'''")
This creates a list I can paste directly into the OS section (after de-wrapping the occasional long line) and fixes problems with non-ascii chars. You might have to change "Latin-1" to whatever charset your terminal uses.
CAVEAT: I don't know the first thing about Python and arrived at the above code by googling. It might be horrendously wrong and/or an insult to all things Python for all I know. Works for me though.6×9 (Talk) 18:06, December 9, 2011 (UTC)

New version that uses "Songs by ARTIST" category and isn't confused by song subpages, artist aliases or multiple artists of the same name: also switched to file output6×9 (Talk) 10:01, November 23, 2013 (UTC)

import wikipedia
import catlib
from wikipedia import Page
from pagegenerators import CategorizedPageGenerator

(...)

def orphans(artist):
    site = wikipedia.getSite()
    allPages = CategorizedPageGenerator(catlib.Category(site, "Category:Songs by " + artist))
    allPages = set(map(lambda p: p.title(), allPages))
    linkedPages = Page(site, artist).linkedPages()
    linkedPages = set(map(lambda p: p.title(), linkedPages))
    orphanedPages = list(allPages.difference(linkedPages))
    orphanedPages.sort()
    outfile = open('C:\orphans.txt', 'a')
    outfile.write("***** " + artist + " *****\n")
    for songPage in orphanedPages:
        s = songPage.encode("UTF-8")
        outfile.write("# '''[[" + s + "|" + s[len(artist)+1:] + "]]'''\n")
    outfile.close()

(...)

Other Languages

Perl

  • TODO: Make Sean's framework and tutorial available

PHP

Python

  • See PyWikipediaBot above.