Wikipedia talk:Database reports

To help centralize discussions and keep related topics together, talk pages for some individual database reports are either redirected here, or shown a notice to start future conversations here.

Archives

Index

Archive 1	Archive 2	Archive 3
Archive 4	Archive 5	Archive 6
Archive 7	Archive 8

This page has archives. Sections older than 180 days may be auto-archived by if there are more than 4.

Requests: Please list any requests for reports below in a new section. Be as specific as possible, including how often you would like the report run.

Database reports/Long stubs - request - filter to remove articles with less than 250 words "readable prose"

Greetings, Recently I was advised to stop removing stubs from articles with "Page size" less than 250 words. Before doing "undo self" to cleanup (January 2025) mess that I made, I updated guidance at WikiProject Stub improvement What_is_a_stub? to clarify stub-size and help others from making this same error. Now I am asking that HaleBot exclude those same small prose size articles from the list. Regards, JoeNMLC (talk) 18:13, 19 February 2025 (UTC)[reply]

I assume determining readable page size of an article requires some sophisticated analysis possibly beyond the capabilities of the bot. But it does appear that about half or more of the articles in this report are legit stubs per the definition you cite. ~Kvng (talk) 17:48, 24 February 2025 (UTC)[reply]

@Kvng - wondering if any of the existing code for Gadget Prosesize can be incorporated into the bot? Yes, that would be a major task but would greatly improve accuracy of "Long stubs" report. Cheers, JoeNMLC (talk) 17:55, 24 February 2025 (UTC)[reply]

How about a simpler solution: Create a hidden category, maybe Category:Long stubs with short prose. We apply that manually to legit stubs per the definition. We then request to exclude articles in this category from the report. ~Kvng (talk) 17:06, 3 March 2025 (UTC)[reply]

@Kvng - Yes, that would work for the bot to exclude so articles are not repeatedly included. Currently about 80-90 percent on the weekly report should not be there & it is a big-waste-of-time skipping those. When one of the articles are expanded with more prose, how would people know to remove Category:Long stubs with short prose? I do know about the setting to show all hidden categories at bottom of articles. Would this change (if done) be included into "Tech News"? JoeNMLC (talk) 19:00, 3 March 2025 (UTC)[reply]

We could put description and a link to the category on the Wikipedia:WikiProject Stub improvement page. Someone would have to go through them periodically and review articles that have been significantly improved. Not so different from reviewing articles listed in the report. I'm not sure anyone needs to show hidden categories. The key pieces are excluding articles in this hidden category from the report and the automatically generated listing of the members of the hidden category (Category:Long stubs with short prose). ~Kvng (talk) 19:18, 3 March 2025 (UTC)[reply]

A similar solution would be to make a configuration page linking to article titles that should be excluded from the report. This is just as easy, and completely sidesteps the issues with changing the articles themselves. (Directly excluding pages with x amount of readable prose isn't possible in a pure database report; it could conceivably be done by the bot running the regular report then fetching the text of each page on it, but that's significantly more work.) —Cryptic 19:21, 3 March 2025 (UTC)[reply]

I think it would be better to use Category:Long stubs with short prose than to have to search and update a separate list of known long stubs. ~Kvng (talk) 19:48, 3 March 2025 (UTC)[reply]

I've created the category and added 9 articles to it so the report generation can be tested.

I gather we need to modify Wikipedia:Database reports/Long stubs/Configuration. I don't know SQL but ChatGPT recommends adding an AND NOT EXISTS clause to the existing:

SELECT
:::::  page_title,
:::::  page_len
:::::FROM
:::::  page
:::::  JOIN categorylinks ON cl_from = page_id
:::::WHERE
:::::  cl_to LIKE '%stubs'
:::::  AND page_namespace = 0
:::::  AND page_len > 2000
:::::  AND NOT EXISTS (
:::::    SELECT 1
:::::    FROM categorylinks AS cl_exclude
:::::    WHERE cl_exclude.cl_from = page.page_id
:::::    AND cl_exclude.cl_to = 'Long_stubs_with_short_prose'
:::::  )
:::::GROUP BY
:::::  page_title
:::::ORDER BY
:::::  page_len DESC
:::::LIMIT
:::::  1000;

~Kvng (talk) 19:59, 3 March 2025 (UTC)[reply]

@Kvng - I concur with above. During my past working years I did have a few "close-encounters" with SQL but have no knowledge of WP database, coding, testing, etc. Maybe help from Page watcher here, or the bot operator? Would be great if can be done before Wednesday's weekly processing. Cheers, JoeNMLC (talk) 21:51, 3 March 2025 (UTC)[reply]

In the meantime, it's probably safe and productive to load up Category:Long stubs with short prose ~Kvng (talk) 22:10, 3 March 2025 (UTC)[reply]

~~Just for FYI, I will begin at #500 of the 1,000 and work back to #1... If not tonight, tomorrow morning.~~ Cheers! JoeNMLC (talk) 22:44, 3 March 2025 (UTC)[reply]

Progress: completed articles 10 to 100. JoeNMLC (talk) 22:57, 4 March 2025 (UTC)[reply]

I have sucessfully constructed the query with Petscan. This is probably better than the report since it can be updated at will. ~Kvng (talk) 18:53, 7 March 2025 (UTC)[reply]

Greetings Kvng - Today, I ran above Petscan and am happy with the results. I reduced "Size" from 18000 to 15000, so far in 1000 decrements (first time using that word). Petscan appears to a more powerful tool, and changes are "real-time". Cheers! JoeNMLC (talk) 20:50, 14 July 2025 (UTC)[reply]

There is now an "Update the table now" link on the report page. This does the same query as using Petscan but makes the results available to everyone. Both take about the same amount of time (over a minute) to complete. ~Kvng (talk) 15:41, 17 July 2025 (UTC)[reply]

Losing battle

While I appreciate the effort, I think maintaining Category:Long stubs with short prose is a losing battle. You have to populate it and then remove articles as they get expanded. It's a decent amount of toil. Just count the words per article, it's not that difficult, computers are good at counting. :-) Yes, for annoying and stupid reasons you can't get the word count with SQL alone, but you can use a programming language to iterate through the list of stubs and extract a rough word count. Then you could either have the report exclude based on a word count threshold or you could include a sortable column with the word count. --MZMcBride (talk) 04:23, 6 March 2025 (UTC)[reply]

Here's a very basic script. There are likely better or smarter ways to do this, but this shows the general idea:

#!/usr/bin/env python3

import re
import requests
from bs4 import BeautifulSoup

urls = [
    'https://en.wikipedia.org/wiki/1999_Shetland_Islands_Council_election',
    'https://en.wikipedia.org/wiki/England_Open',
]

for url in urls:
    html_doc = requests.get(url).text

    soup = BeautifulSoup(html_doc, 'html.parser')

    word_count = 0
    text = ''

    for p in soup.find_all('p'):
        if p.text.find('You can help Wikipedia by expanding it.') == -1:
            text = re.sub(r'\[\d\]', '', text)
            text += p.text.strip() + ' '

    print(url)

    print(text)

    print('Word count: {}'.format(len(re.findall(r'\w+', text))))
    print()

And then the output is:

$ ./venv/bin/python ./wiki_word_count.py

https://en.wikipedia.org/wiki/1999_Shetland_Islands_Council_election
 Lewis Shand Smith
Independent Tom Stove
Independent Elections to the Shetland Islands Council were held on 6 May 1999 as part of Scottish local elections. The Liberal Democrats won 9 seats, the party's best result in a Shetland Islands Council election.  Nine seats were uncontested.
Word count: 46

https://en.wikipedia.org/wiki/England_Open
The England Open  is a darts tournament that has been held annually since 1995.
Word count: 14

This is for 1999 Shetland Islands Council election and England Open. This approach is not perfect, of course, but it's a decent approximation. --MZMcBride (talk) 04:55, 6 March 2025 (UTC)[reply]

@MZMcBride, thanks for the suggestion. The word count we're looking for needs to match what Wikipedia:Prosesize reports. This exclude references, lists and tables and I'm not sure what else. We could try to borrow source code from there or reverse engineer exactly what it is doing, but that all seems like a large project requiring ongoing maintenance and producing relatively small reward. I'm not sure who's qualified and willing to take this on. The Category:Long stubs with short prose solution discussed above has support and we've started using it and it appears likely to meet our needs. I just need someone to show me how to test the SQL changes I've proposed above. ~Kvng (talk) 15:20, 6 March 2025 (UTC)[reply]

Observations: - The current "Long stubs" wikitable is 80-90 percent incorrect. Would it be simpler/easier to disable that HaleBot task for now? Then make a single-function bot to read through the 2.3 million stubs and 1. find articles with prose-size over 250 words; 2. output to a plain wikitable (no need to sort by size) just the first 1,000 articles. While the Category:Long stubs with short prose approach may be a short-term fix, it is very time consuming, treats the symptom and does not solve the actual problem. (just my opinion). Regards, JoeNMLC (talk) 16:12, 6 March 2025 (UTC)[reply]

I disagree with the implied assertion that the report is not valuable and I oppose the suggestion to disable the task that generates it. Editors have already worked the top portion of the list. There are more than 10-20 percent actionable articles in the lower half of the report. It would be nice if the report took readable prose into account but I disagree that the Category:Long stubs with short prose approach is very time consuming. I think it is beneficial to have actual eyes on some of these marginal stubs as, in general, there's no mechanical formula for assessments. ~Kvng (talk) 16:43, 6 March 2025 (UTC)[reply]

Kvng, you can see at MediaWiki:Gadget-Prosesize.js#L-139 it just uses the api of toolforge:prosesize. — Qwerfjkl talk 12:19, 27 September 2025 (UTC)[reply]

Ok, we could write a bot to post-process the query results and remove everything that is actually short. We now have 8,767 articles in Category:Long stubs with short prose so if we wanted to use post processing instead of Category:Long stubs with short prose to exclude the long stubs, we'd have to have the report output ~10,000 results and that would have to grow as we worked through this. Are reports that large feasible? The post processing would have to use toolforge:prosesize thousands of times for post processing. ~Kvng (talk) 22:28, 27 September 2025 (UTC)[reply]

Ammended observations: - Because other editors have completed some of the 1,000 articles, I am repeating the same work already done. Keep the bot running, just add a tracking system for editors to communicate to others what parts of the list are completed. See the example below for details. JoeNMLC (talk) 15:06, 7 March 2025 (UTC)[reply]

WP Stub improvement progress

Below is the "Announcement panel" of a progress tracker for weekly HaleBot report Long stubs. Identifies articles "Open", "In process", and "Done". Note that when added to the bot report, this will be deleted with each new report. Ask if bot can output a new panel?

WikiProject Stub improvement – Long stubs Progress

Instructions: Un-comment the Open line below to activate In progress line for articles to check.

Articles – Status

1–100 – Open
101–200 – Open
201–300 – Open
301–400 – Open
401–500 – Open
501–600 – Open
601–700 – Open
701–800 – Open
801–900 – Open
901–1000 – Open

When completed, please change In progress line to Done.

Sorry this report doesn't conform to your suggested format. I've checked 1-575 in the 5 March report and all improperly marked stubs have been assessed. There are still some legitimate stubs in the middle of this range that have not yet been put into Category:Long stubs with short prose.

I suggest we move this discussion to Wikipedia_talk:WikiProject_Stub_improvement. ~Kvng (talk) 15:44, 7 March 2025 (UTC)[reply]

As of now, I'm up to #180 for adding Category:Long stubs with short prose, not doing assessing. JoeNMLC (talk) 15:56, 7 March 2025 (UTC)[reply]

Have you considered editing the report to remove pages that have been processed? Use an appropriate edit summary, and keep the table's basic format intact. That way, editors who visit the report will not have to repeat work. When the bot runs again, it will replace the page contents with an updated table. – Jonesey95 (talk) 16:38, 7 March 2025 (UTC)[reply]

Thanks for the suggestion. The table uses class="wikitable sortable static-row-numbers static-row-header-text" so table source doesn't have the item numbers in it which makes it difficult to find things. If things are removed, everything gets renumbered. Might be better to have the bot refresh this report more often (at least until we're done working on the backlog).

Most important I think is improving the selection criteria for the report to omit entries in Category:Long stubs with short prose. Who can help me learn how to test my proposed changes? ~Kvng (talk) 18:38, 7 March 2025 (UTC)[reply]

The item numbers won't matter if the table is kept up to date by removing processed items. As for refreshing the report more often, if you can get access to the SQL that is used to generate the report, I can convert the page to use {{database report}}. – Jonesey95 (talk) 20:35, 7 March 2025 (UTC)[reply]

I've got Petscan set up (see above #Losing battle) and this seems to remove the need for more frequent updates or report editing. The current SQL is at Wikipedia:Database_reports/Long_stubs/Configuration. My proposed improvement is also above #Losing battle. ~Kvng (talk) 23:19, 7 March 2025 (UTC)[reply]

I have updated Wikipedia:Database reports/Long stubs so that it will run every day and can be updated manually with the link at the top of the page (it takes a few minutes to run). Further improvements to the report SQL can be made on the page (please test in Quarry first). Updates to the page's display can be made using the parameters documented at {{database report}}. – Jonesey95 (talk) 23:46, 7 March 2025 (UTC)[reply]

@Jonesey95 - Thank you for these changes, i.e. daily wikitable update excluding articles with Category:Long stubs with short prose. This is most helpful! Yesterday I tagged about 10 or so articles with that cat. & today they are not within the wikitable, so I would conclude: IT'S Working. Cheers! JoeNMLC (talk) 15:23, 9 March 2025 (UTC)[reply]

Missing results

As we work our way through the report, each time the report is updated, it seems new long stubs are found. I don't see anything in the history of these newly-reported articles that indicates there were recent changes that caused them to meet our long stub criteria. I guess there's something non-obvious about how this SQL query works. I was expecting it was reporting the 1000 largest stubs in the encyclopedia but that does not seem to be the case. ~Kvng (talk) 21:04, 28 May 2025 (UTC)[reply]

@Kvng - Considering there are 2.3 million stub articles, it's possible for the query to just begin plowing through "somewhere" starting point. Each day within Category:Unassessed articles, on some of the articles I assess as "Stub-class" do appear on the following day(s) report. For example: Category:Unassessed AfC articles, Category:Unassessed awards articles, and Category:Unassessed Athletics articles, all of which I assess a few frequently. If an article already has a stub rating on the article page, it can be "Unassessed" on the Talk page. So it's possible the query skips those unassessed articles, and does not consider them for the report until after assessment is completed. It may ignore article stub tags. Cheers, JoeNMLC (talk) 23:45, 28 May 2025 (UTC)[reply]

No, that's not, in fact, possible. The query will list the titles of the thousand longest, as measured in bytes, pages that are in mainspace and in at least one category whose name ends in "stubs". (It won't list pages of length 2000 bytes or less.) It doesn't start in a random place, and it has nothing whatsoever to do with the categorization of the talk page. The limited spot checking I just did of newly-appearing pages has, in every case, a corresponding edit adding bytes. Which pages are claimed to show otherwise? —Cryptic 23:57, 28 May 2025 (UTC)[reply]

@Cryptic, I reviewed the articles I rated today and couldn't explain and find that there is, after all, an explanation in every case. Sorry for the false alarm.

If this query can be done more efficiently, that would be appreciated. It does sometime timeout when manually kicked off. ~Kvng (talk) 01:24, 29 May 2025 (UTC)[reply]

How about Black Reel Award for Outstanding Screenplay, Adapted or Original, from 21 March 2025 . JoeNMLC (talk) 00:18, 29 May 2025 (UTC)[reply]

It wasn't in a category ending in "stubs" before then. It was in a category ending in "stubs" after that, because you put it in one.

Aside from anything else, the query is fairly bonkers. It does work; and it may have been reasonably optimal for what it does when it was written, but it sure isn't now - getting a list of applicable categories from the category table first, and then looking at everything in them for length, is about six times faster than looking at all pages for length and only then culling for categorization. And just doing it with the single category Category:All stub articles should be equivalent (assuming the stub templates are well-formed - those that aren't need to be fixed anyway) and while it's only marginally faster than that, it simplifies the query greatly. —Cryptic 00:44, 29 May 2025 (UTC)[reply]

(As an aside to my aside, here's a list of all categories whose name ends in "stubs" that currently contain at least one mainspace page that's not in Category:All stub articles. —Cryptic 01:07, 29 May 2025 (UTC))[reply]

@Cryptic - Yes, that quarry:query/94073 is an interesting list. I checked the first 2 articles (Faucett F-19 and Atmaprakash (novel)) to find both have a "category stub" instead of a correct "template-stub". So I updated both of them. The list is a useful tool to cleanup those articles. Thanks. JoeNMLC (talk) 13:49, 29 May 2025 (UTC)[reply]

Opinions needed

The linked miscapitalizations report is cluttered with tens of thousands of piped links that don't necessarily need to be fixed, but these obscure the hundreds of non-piped (visible) links that do. Changing the query/bot to omit these is near-impossible, so what other choices do we have? I suggest just clearing them all out using AWB in one go, but that seems to be a controversial idea so opinions are needed. Electricmemory (talk) 23:25, 27 March 2025 (UTC)[reply]

Clearing them all out without examining them case-by-case is probably not the right approach. Some articles, like À Beira do Caminho, appear on that list because of (IMO) improper page moves. Just setting AWB to replace all of the links will reinforce those improper moves and make more work when the move is reverted. There are only 755 articles on the list currently, down from 1,000+ one month ago; I suggest working your way down the list and fixing the pages that are actual errors and that are not piped links, then reviewing the list to see what is left. – Jonesey95 (talk) 01:32, 1 April 2025 (UTC)[reply]

I've done a bit of work with templates that transclude these miscapitalized links, and if we exclude the huge batch of "Census" links for now, there are just 100 pages with 100 or more links. If you check each article to make sure it has been properly moved, then set your AWB settings to look for and replace only unpiped links, you should be able to cruise through those pages pretty quickly. – Jonesey95 (talk) 17:52, 1 April 2025 (UTC)[reply]

That's better, but IMO still doesn't fix the problem very well... Electricmemory (talk) 02:45, 15 April 2025 (UTC)[reply]

Setting AWB to highlight only non-piped links doesn't stop it from digging through all of the pages anyways, that's just more of a waste of time Electricmemory (talk) 02:50, 15 April 2025 (UTC)[reply]

@Jonesey95 the Census links are a huge source of the problem, though. Even if I were to go and fix all the non-piped links in those right now, I would still have to dig through them all again in the future to find the proverbial needle in the haystack. I want to demolish the haystack once and stop it from forming again, if that makes sense. Electricmemory (talk) 02:48, 15 April 2025 (UTC)[reply]

You'll probably need a BRFA then. – Jonesey95 (talk) 13:37, 15 April 2025 (UTC)[reply]

Request

As there are certain maintenance categories (often template-transcluded) that do not need to be cleaned up for removal of draftspace or userspace content, the Wikipedia:Database reports/Polluted categories and Wikipedia:Database reports/Polluted categories (2) reports are supposed to be excluding categories that have been tagged with {{Polluted category}} — but for a long while now they haven't been doing that, so both of those reports remain permanently cluttered up with maintenance categories that don't need to be addressed, such as Category:Miscellany to be merged, Category:Wikipedia Student Program or Category:Temporary maintenance holdings/RFD.

So could these reports please be updated to recognize and skip categories with {{Polluted category}} on them again like they used to? Thanks. Bearcat (talk) 22:04, 10 April 2025 (UTC)[reply]

Request for a report: Internet Archive Book Scans.

A list of works or resources hosted on archive.org, linked to from English Wikipedia, including information as to the existence of a local copy of the resource on Commons. ShakespeareFan00 (talk) 19:19, 22 April 2025 (UTC) Withdrawn for now. ShakespeareFan00 (talk) 06:49, 26 April 2025 (UTC)[reply]

Request: Links to cited works, hosted externally with scanned copies available.

A list of citations and external links in articles, where the link is to a scanned version of a work (such as PDF or DJVU). The report could be further split up by the date of publication given in the citation or footnote. The list of included sites to be determined (such as Internet Archive, Google Books and Hathi Trust. ShakespeareFan00 (talk) 11:54, 26 April 2025 (UTC)[reply]

"WikiProjects by changes" presentation seems backwards

I have been using {{Database report}} to produce useful reports in a WikiProject for a long while, and this has really propelled this WikiProject up the list. But this was never my intention. Also, the idea that bots aren't excluded from the main ranking doesn't seem to do what the report is designed for - to demonstrate user activity within a WikiProject. My suggestion here is to make "Edits excl. bots" the primary ranking column, and "Edits incl. bots" a column one can alternatively sort by. Thank you for your consideration. Stefen 𝕋ower_{s among the rest!} ^{Gab • Gruntwerk} 03:55, 14 June 2025 (UTC)[reply]

Legoktm, Dbeef: Since you both operate HaleBot, which updates this report, might you have any thoughts on this? I'm not sure of the purpose of including bot edits in the main sort for "WikiProjects by changes". Am I missing something? Stefen 𝕋ower_{s among the rest!} ^{Gab • Gruntwerk} 22:00, 7 July 2025 (UTC)[reply]

I will assume the radio silence means the way the report works is intentional. So projects adding automated reports will just zoom up the list. Stefen 𝕋ower_{s among the rest!} ^{Gab • Gruntwerk} 22:30, 26 July 2025 (UTC)[reply]

To follow up, I'm working on something to implement my request. I have figured out the query for showing active WikiProjects in the past year by human edits, which I think is what matters most in showing project interest. Stefen 𝕋ower_{'s got the power!!1!} ^{Gab • Gruntwerk} 13:23, 21 August 2025 (UTC)[reply]

Done. I have created two new reports: "WikiProjects by human changes" and "WikiProjects with no activity". This is effectively a split of "WikiProjects by changes" to get the most out of SQL performance enhancements, and I've added a couple new useful columns to the first report. Enjoy. Stefen 𝕋ower_{'s got the power!!1!} ^{Gab • Gruntwerk} 02:49, 23 August 2025 (UTC)[reply]

To scratch a technical inch and provide "one-stop shopping" to see all the WikiProject statuses on one page, I have updated the "WikiProjects by human changes" report by adding a Status column. I also removed a few false positives from the results. Enjoy! Stefen 𝕋ower_{'s got the power!!1!} ^{Gab • Gruntwerk} 05:08, 24 August 2025 (UTC)[reply]

Another update: I revised the SQL in "WikiProjects by human changes" to 1) Reduce selects for greater efficiency; 2) have a smarter determination of Status (affects just a few results); and 3) do a tertiary sort by WikiProject name (provides more stability in lower-ranked results from run to run). Stefen 𝕋ower_{'s got the power!!1!} ^{Gab • Gruntwerk} 11:00, 25 August 2025 (UTC)[reply]

Reports possibly breaking due to rc_new and rc_type going away this week

See Wikipedia:Village pump (technical)#Tech News: 2025-35. The rc_new and rc_type fields in the recentchanges table are getting wiped out sometime this week. This may affect some of the reports. I only looked at the WikiProject reports, and "New WikiProjects" will need a minor update. I could replace it with a Database report template but if the bot maintainer would rather fix it, that would be fine. The report will likely break on its next run, August 31. Stefen 𝕋ower_{'s got the power!!1!} ^{Gab • Gruntwerk} 10:46, 26 August 2025 (UTC)[reply]

You might also want to watch Category:SDZeroBot database report failures, if you aren't already. —Cryptic 11:38, 26 August 2025 (UTC)[reply]

Thanks for the info. I will watch that. That just helps with reports generated by {{Database report}}, right? Most of the ones on this page seem to be generated differently. Stefen 𝕋ower_{'s got the power!!1!} ^{Gab • Gruntwerk} 11:45, 26 August 2025 (UTC)[reply]

That's correct. —Cryptic 11:49, 26 August 2025 (UTC)[reply]

The "New WikiProjects" report is now replaced and working, and shall now run weekly rather than fortnightly, LOL. Stefen 𝕋ower_{'s got the power!!1!} ^{Gab • Gruntwerk} 04:22, 7 September 2025 (UTC)[reply]

New WikiProject request

Greetings admin, please I will like to request for the creation of a new WikiProject for our Mentor Me! programme, where new editors in the English Wikipedia get coached on creating new articles and developing them. This programme is going to be a continuous one as we have plans of continuing it next year and beyond on a global scale. Thanks and warm regards, Kambai Akau (talk) 21:36, 24 September 2025 (UTC)[reply]

Please see Wikipedia:WikiProject#Creating and maintaining a project. The page you're currently asking on here doesn't handle the creation of WikiProjects - it just contains a few reports about them. Stefen 𝕋ower_{'s got the power!!1!} ^{Gab • Gruntwerk} 21:48, 24 September 2025 (UTC)[reply]

Okay, thanks @StefenTower. Kambai Akau (talk) 21:42, 25 September 2025 (UTC)[reply]