Template talk:redlink category

Definition from Wiktionary, the free dictionary
Jump to navigation Jump to search

Expensive function error messages[edit]

@Barytonesis: This edit removed the expensive function error message in Appendix:Ancient Greek Swadesh list. Manually disabling module functions that are exceeding the limit on expensive functions is currently the only way to prevent module errors. I started a Phabricator request for a better way, but that may or may not go anywhere. — Eru·tuon 01:29, 30 October 2017 (UTC)

We should delete this.[edit]

This approach is the wrong way to solve the problem, and it causes more problems by virtue of its existence.

  1. By having the server do the heavy lifting of parsing and looking up whether or not a title exists for a language we use up a finite resource (Lua memory/processing time) and needlessly add a ton of computation to the project. The computation is repeated far more often than is necessary. Since the Lua memory is finite and when exhausted causes the pages to load incorrectly, this harms the user experience and causes additional work for editors in tracking down and refactoring pages which have errors. Over time this issue will only increase as more pages reach the Lua memory limit.
  2. This does not need to be accomplished in real time. Using a module for this task is akin to going to the grocery store once for every ingredient as you are trying to cook a recipe. The better strategy is to go to the store periodically and save lots of time and resources. This task would be easy to run periodically, and in a way which could be more complete and require less effort.
  3. The results, while useful, are far from necessary. If this module accomplished something which could not otherwise be accomplished it still wouldn't justify its existence. Breaking page after page over time is more problematic than not having a convenient, language-specific list of red links ready to hand. Fortunately there are other, easier ways to accomplish this task which have less downside, so the point isn't terribly relevant.

I suggest that we get rid of this method of generating red-link categories, and instead move the idea to the toolserver with a group project that periodically (every time a dump is released, conservatively monthly) regenerates such lists without the need for exceptions. All words in all languages could easily be covered, without the need for ever-growing exception lists. If the project were done on the toolserver it could be maintained and operated by a group rather than an individual, so the problem of individual-point-of-failure is reduced. I am willing to help with the implementation, and I know for a fact that others have already implemented the same concept at least half a dozen times, so there are others who are capable of doing so as well. - TheDaveRoss 12:33, 14 March 2019 (UTC)

Sounds sensible, and I would be in favour of such a solution if and only if someone is willing to actually implement it - I personally don't have the technical know-how. As it stands, I find these categories really useful (especially the translation redlink categories, which allow me to weed out nonsensical neologisms in dead languages very easily), and if possible would very much like there to be a replacement if it is to be deleted. — Mnemosientje (t · c) 13:07, 14 March 2019 (UTC)
Agreed. The fact that it's limited to a few languages makes it less useful. I'm willing to help implement this. – Jberkel 14:41, 14 March 2019 (UTC)
Sounds right to me. Many maintenance categories are not needed in real time. DCDuring (talk) 16:01, 14 March 2019 (UTC)
Good idea. It all depends on whether someone wants to work on it, but I imagine that using the dump more interesting data could be generated, like most-redlinked entries in particular languages, and redlink checking could be extended to templates that don't currently have it like {{der}}, {{der3}}, {{alter}}. I'm interested in learning how this would be implemented; maybe I could contribute. — Eru·tuon 23:19, 18 March 2019 (UTC)
Having most-redlinked stats will be great to determine which new entries to create first. Unfortunately we don't have full HTML dumps (phab:T182351) yet, so if we work with XML dumps we need to add special parsing logic for some of the templates you mentioned, which means we won't get 100% link coverage., but it should be good enough. – Jberkel 11:41, 19 March 2019 (UTC)
Oh, this category is only used in {{m}}, {{l}} and {{t}}, so any static approach should deliver the same results without a lot of extra effort (I first assumed that the checking happened in the bowels of Module:links or similar). Jberkel 13:03, 20 March 2019 (UTC)
I've created a program for printing all instances of a particular template. I imagine it might simplify things to have lists of all instances of each linking template. The program could be modified to print each relevant template (and its redirects) to a separate file or it could be run multiple times because for me at least it takes less than 90 seconds (even for a widely used template like {{m}}). [Edit: The program also comes with the bonus of detecting invalid template syntax.]
Could create custom logic to determine which parameters in each template indicate an entry name, or try to generate the output of the invoked module functions by replicating the module system and supplying fake frame objects to the functions. I imagine the first option would be simpler; at least, I don't know how easy replicating the module system would be. — Eru·tuon 23:26, 22 March 2019 (UTC)
I've taken the first approach, using a Wikitext parser. For this to work I had to convert the language data modules to JSON, and then reimplement Language:makeEntryName. The bulk of the links come just from a few templates, so doing parameter extraction is not too bad. Here are the first results, sorted by link count: User:Jberkel/lists/missing/20190320/all. Some results are interesting, but there are also many instances where the link target is simply wrong. – Jberkel 00:02, 23 March 2019 (UTC)
For the incorrect links, like prepositional pronoun where the language code is gd, it looks like it would be useful to generate a list of all pages with such links so they can be corrected. — Eru·tuon 00:22, 23 March 2019 (UTC)
Yes, I want to do this, and split the output into several pages, maybe just one per language. With the orange links it's sometimes tricky to tell if the link is wrong or if the entry is missing. I'll see if I can put this on the toolserver as suggested above. – Jberkel 00:39, 23 March 2019 (UTC)
For Arabic and Hebrew, it would also be helpful to show the original link text (with vowel diacritics), because there could be several different words with the same letters but different diacritics. Perhaps also for other languages. Not sure how to make that work though. — Eru·tuon 02:01, 23 March 2019 (UTC)
That's doable. Right now everything is transformed to the canonical form because it is easier to aggregate. How many redlinks should we display per language? Just the most common ones? And maybe it's worth prioritizing redlinks from translation tables, or from entries in another language. What is exactly needed, how are these categories used currently? @Erutuon, TheDaveRoss, DCDuring, MnemosientjeJberkel 07:47, 23 March 2019 (UTC)
@Jberkel Thanks for the sortable table. I started to use it to clean up some obvious errors in "en" like using "en" where "kn" would be right. Unfortunately, the sort order reverted to frequency rather than language, so the process takes longer than it should. How can I make the sort order and focus durable? DCDuring (talk) 13:47, 23 March 2019 (UTC)
Also, the left-pointing arrow provides all the links to a given entry, not just the problematic ones. DCDuring (talk) 13:59, 23 March 2019 (UTC)
@DCDuring: Yes, the arrow just uses the "what links here" page. To find the links you can also use insource search, e.g. insource:"{{m|de|Test}}". In the next run I'll include links back to the entries. – Jberkel 21:33, 25 March 2019 (UTC)
@Jberkel: Right. I thought you probably would press on in that direction, but that perhaps something else would come up. (You know about bright, shiny objects, I'm sure.) Thanks for doing this. DCDuring (talk) 00:01, 26 March 2019 (UTC)
Did the sort order change when you clicked a link and then went back? The JavaScript doesn't automatically re-sort the table to the way it was sorted before when you visit it again. That could probably be done, but would require some custom JavaScript. — Eru·tuon 20:20, 23 March 2019 (UTC)
@Erutuon: Exactly. It would be nice and not just for this. But I don't know how many people value such a capability, nor how hard it is to code. For the task at hand JBerkel says he intends to produce from the dump something more usable that the sortable table. I can wait for that. DCDuring (talk) 00:01, 26 March 2019 (UTC)
For the use-case mentioned by Mnemosientje, clearing out fictitious translations, I think all redlinks have to be accessible because it's very likely that the fictitious translation is just redlinked once. — Eru·tuon 19:32, 23 March 2019 (UTC)
@DCDuring, Erutuon There's a new list, User:Jberkel/lists/missing/20190401/all, this time with source entry links. It needs a bit more tweaking, there are still a couple of blue links in there. Once everything works I'll generate language specific lists. – Jberkel 03:05, 8 April 2019 (UTC)
@Jberkel: Very nice! (In my browser, it takes a while for the JavaScript to process the collapsible elements, so the page is unresponsive for several seconds, but that may be fixed when the list is broken up.) Not all the blue links are correct. MediaWiki:Gadget-OrangeLinks.js is not quite smart enough to tell that Category:Norwegian Bokmål lemmas in stille is not a Norwegian category, or that Category:Ancient Greek redlinks in παστός doesn't indicate that there's an Ancient Greek entry, for instance.
To show only the blue links (not red or orange), I entered $('.mw-parser-output tr').has('span a.new, span a.partlynew').hide() in my browser's JavaScript console. One, rock samphire, had the wrong header: Translingual instead of English. (I wonder if anyone checks for that type of situation.) Several of the blue links are Norwegian or Ancient Greek, which are incorrectly oranged by the gadget, but some are genuine (rimay has a Quechua entry, обући a Serbo-Croatian entry). I wonder what's going on with the latter. — Eru·tuon 03:59, 8 April 2019 (UTC)
@Erutuon: Ah, that's good to know, I haven't checked everything in detail yet. The problem with Quecha/Serbo-Croatian is likely related to entry normalization not yet behaving exactly like Module:languages#Language:makeEntryName, I'll look into it. – Jberkel 07:08, 8 April 2019 (UTC)
@Erutuon Bugs squashed and table regenerated, Quechua/Serbo-Croatian entries are gone now. What about Norwegian links? When is {{m|no}} actually used? We have around ~1900 pages with "Norwegian" L2 (compared to 61K Bokmål, 42K Nynorsk). – Jberkel 19:37, 8 April 2019 (UTC)
@Jberkel: I don't really know about the Norwegians. (To me it seems like there should be one Norwegian with qualifiers for Bokmål and Nynorsk, but I don't make the rules.) Maybe Donnanz can explain. — Eru·tuon 19:42, 8 April 2019 (UTC)
@Erutuon: What exactly should I be looking at? DonnanZ (talk) 19:50, 8 April 2019 (UTC)
@Donnanz: Regarding my question earlier, when would you link to Norwegian instead of Bokmål/Nynorsk? Specifically, are the translation links from appease, quiet, mute etc. to stille wrong? – Jberkel 20:16, 8 April 2019 (UTC)
@Jberkel: The objective is to replace Norwegian with Bokmål and Nynorsk, except for surnames and given names which are impossible to separate. This has already been done to a large extent in the Norwegian entries. You are going to find thousands of entries like this as I have concentrated on the Norwegian entries, not the English ones; anyway I fixed quiet, but was unsure with appease and mute, so I removed Norwegian stille completely from those. DonnanZ (talk) 21:17, 8 April 2019 (UTC)
I forgot to say that Bokmål and Nynorsk entries may already exist, but links using {{t|no}} won't link directly to them, so they have to be checked carefully. DonnanZ (talk) 21:24, 8 April 2019 (UTC)
Thanks for the info. @Erutuon I've generated some language lists, overview page at User:Jberkel/lists/missing/20190401, taking the top 1000 requested entries for each. Seems to work quite well, the few bluelinks I spotted were new entries created after the dump was created. – Jberkel 21:53, 8 April 2019 (UTC)

@Jberkel: (edit conflict) The new language-specific pages that you're creating are still slow to load on my computer because of the collapsible elements. I think I'll try to make a faster collapsibility script that can be used instead of jQuery.makeCollapsible (mw-collapsible).

It would also be nice to put {{scripttaglinks}} around the lists of pages (in the Sources column) to display the links in decent fonts. — Eru·tuon 21:57, 8 April 2019 (UTC)

I created User:Erutuon/scripts/missingEntries.js, which works with the list format here. It's faster than jQuery.makeCollapsible, but it still takes a few seconds. — Eru·tuon 23:00, 8 April 2019 (UTC)

Template:User:Erutuon/missing entries.css (when {{scripttaglinks}} is used) ensures that commas between right-to-left terms in the "Sources" display left-to-right: نرگس, ܢܪܩܝܣ rather than نرگس, ܢܪܩܝܣ. — Eru·tuon 23:26, 8 April 2019 (UTC)

Maybe the best solution is to decrease page size, and implement proper pagination. Maybe this could be done with a module, which would generate the table. – Jberkel 06:37, 9 April 2019 (UTC)
Yeah, that would work. I suppose data could be stored in a plain text format and then the module could grab as much as it needed to. A format that uses tabs as separators has worked well for storing possibly incorrect headers and is easy to parse with Lua. The data currently displayed could be laid out with the following tab-separated fields: language code, title, entries that link to it. This would be briefer than a data module or JSON. — Eru·tuon 07:19, 9 April 2019 (UTC)
Ok, I'll export the data in that format then. – Jberkel 07:23, 9 April 2019 (UTC)
Okay, I've created a module here that uses the data format seen here. — Eru·tuon 07:47, 9 April 2019 (UTC)
Data uploaded + pages updated. – Jberkel 12:48, 9 April 2019 (UTC)
Looks like there's no easy way to do pagination without creating (placeholder) pages. Still, with the module the output is a lot more manageable, and the concerns are nicely separated. There are a lot of pages to create, I'm wondering if it's really worth it (who is going to page through 60k entries?) An alternative would be to just show the top lists here and provide the full datasets on toolforge? – Jberkel 19:53, 9 April 2019 (UTC)
Perhaps pages could be created on request. But I think it would be a good idea to provide all the data somewhere, so that anyone can copy the data to create a list (rather than having to request it from you).
I imagine that weeding out fictitious translations would be easier with lists of redlinks in {{t}} and {{t+}}. I'm not sure if it would be useful to create separate lists for all templates though. — Eru·tuon 21:28, 9 April 2019 (UTC)
Ok, I'm currently trying to get the code run on toolforge. It's … slow. – Jberkel 21:42, 9 April 2019 (UTC)
Yes check.svg Done The whole thing now runs on toolforge. It's a bit slower, but it only needs to run twice a month anyway.
@Erutuon Do you have an account there? I can add you to the project. The index page now contains links to the full data in JSON format. – Jberkel 20:52, 10 April 2019 (UTC)
@Jberkel: I signed up for a Wikimedia developer account (without requesting access to Toolforge), but now I can't figure out what my username and password were and don't know how to recover them, and creating new accounts is currently disabled. So I guess being added to the project is out for now. — Eru·tuon 20:57, 10 April 2019 (UTC)
I'm curious to see the code to find out if there are ways to optimize it. I wonder if my template extractor would be useful or not, or an optimized entry name generator. I'm mad at myself for not carefully remembering my username and password. Maybe I'll figure it out one of these days. — Eru·tuon 21:40, 10 April 2019 (UTC)
The code is on gitlab. It's written in Java/Python and uses Spark, which is probably overkill but works really well, even when used on a single machine the processing is distributed on all cores. The tradeoff is more overhead in shuffling/serializing data, but I haven't actually measured it. Parsing the markup also takes time, it builds a complete AST. I'm wondering if that tree could be serialized, so all future processing on it would be very fast. What's also really nice in terms of productivity is a SQL-like query interface in Spark which lets you do all sorts of aggregations on flat data files (you can see a snapshot of an interactive session here). – Jberkel 22:27, 10 April 2019 (UTC)
The parsing might be faster if the script were working from files that contain all the instances of the templates. My program can print all instances of {{l}}, {{m}}, {{t}}, {{t+}} each to a separate file with the titles of the pages on which they were found, in the format '\1' <title> '\n' (<template> '\n')+. (The \1 character can be replaced with something else if necessary.) The program takes under 2 minutes even to print a large number of different templates, like this list of form-of templates. The sizes of the files for each template were as follows: {{l}}, 62M; {{t}}, 29M; {{t+}}, 27M; {{m}}, 17M. Maybe adding the intermediate step of grabbing all instances of the templates and printing them in a usable format would speed things up.
The method of finding a template is rudimentary: it just looks for matching nested {{ and }} where {{ is followed by a non-empty string, in each position of which neither | or }} match. But I think that works in most pages in the main namespace, where fancy template syntax is rare. — Eru·tuon 23:22, 10 April 2019 (UTC)
Yes, I think an index is the way to go. We could generate something containing a mapping of all templates to pages (once), then we could run any type of query really efficiently. I'll try that for the next version. – Jberkel 09:35, 11 April 2019 (UTC)
It would also be great to extend this to more templates, especially the most common etymology templates ({{der}}, {{inh}}, {{bor}}, {{cog}}, {{noncog}}). That would require a Java version of getNonEtymological from Module:etymology to convert etymology language codes to regular language or language family codes. Other templates that do not accept etymology language codes would be easier to add. — Eru·tuon 22:13, 11 April 2019 (UTC)
Ah, this requires the language type which is currently missing (Module:languages/documentation#Language:getType). I'll add it. – Jberkel 22:26, 11 April 2019 (UTC)
Oh, the other thing is adding the Reconstruction and Appendix namespaces. I was puzzled at first that the all.jsonl file doesn't include any links from languages ending in -pro besides Proto-Norse (gmq-pro), but that's probably the only such language that has entries in the mainspace. — Eru·tuon 23:08, 11 April 2019 (UTC)
For curiosity's sake, here is a census of the total byte counts of some templates that can contain links to Wiktionary entries. (Redirects are included.) — Eru·tuon 23:18, 11 April 2019 (UTC)
That's a useful list of cases to cover then. Something else I remembered which is missing: nested links are currently skipped, e.g. {{m|en|[[foo]] [[bar]]}}. – Jberkel 23:27, 11 April 2019 (UTC)

The new data is useful for finding mistakes, like Akkadian transliteration put in the wrong parameter, and Greek and Ancient Greek words in the wrong script. Not all of these would be caught by {{redlink category}}, because it only checks that a page exists, not that the language section exists. — Eru·tuon 01:52, 12 April 2019 (UTC)

I made the changes you suggested. It's currently churning through the dump on toolforge. – Jberkel 00:10, 13 April 2019 (UTC)
New lists are published. I haven't added all templates yet, but I think the most common ones are covered now. – Jberkel 08:47, 13 April 2019 (UTC)
Looks like I missed suffix/affix/prefix. Will add them later. – Jberkel 09:03, 13 April 2019 (UTC)
Wonderful! It's interesting how with the addition of the first etymology templates some languages jumped up in the rankings, especially Middle English, which wasn't even in the top 30 or so before. And there are some new words that apparently were only linked in etymology templates, like κεράσιον (kerásion) at the top of the Ancient Greek list. — Eru·tuon 17:22, 13 April 2019 (UTC)
Yes, it's getting more useful, and interesting to see the connections and patterns. Do you know what's up with all those Lojban entries in the "all" list? Are they legit or is some filtering needed? – Jberkel 18:48, 13 April 2019 (UTC)
Hmm, they aren't actual redlinks or "orange links". Lojban is an appendix-reconstructed language, but the logic you're using doesn't account for that, so it looks for an entry in mainspace rather than Appendix namespace. I guess appendix-reconstructed languages should be filtered out, like reconstructed links, until you've developed the arcane logic to handle them. — Eru·tuon 18:55, 13 April 2019 (UTC)
It occurs to me that the link data could also be used (together with data on {{senseid}} templates) to find links that go to a nonexistent sense id. Oh wait, I guess at the moment the values of id parameters are not being saved, so it would require some changes to the data generation. — Eru·tuon 21:18, 13 April 2019 (UTC)
It would require indexing all instances of {{senseid}}. How many do we have? While useful I don't think it's a widely used template.
I've just published new lists, almost all of the etymology templates are now parsed (still missing are the column templates). Affixes now dominate the lists. I also found some "creative" template usage, e.g. Special:Diff/48335858/48335878, which nests {{suffix}} inside {{compound}}. – Jberkel 13:16, 14 April 2019 (UTC)
There are also cases like {{bor|en|ja|{{ja-r|御殻|おから}}}} (okara), maybe the safest thing to do is to skip nested templates wholesale. – Jberkel 17:33, 14 April 2019 (UTC)
My {{senseid}} file from the mainspace comes to 132093 bytes, and there are 4195 total instances on 2273 pages. There are a few more in the Reconstruction namespace. (I should have my program look there as well.) I just figure it might be easier to track sense ids with your program than to create a separate one that duplicates some of the same work.
{{ja-r}} would be pretty tricky to parse correctly, so I would omit it (though there are bound to be a lot of redlinks that will be excluded that way). What about other templates like {{l}} and {{m}}? Does your program gather them when they're inside other templates, like when {{l}} is inside the |t= parameter of {{l}} or {{m}}? — Eru·tuon 21:07, 14 April 2019 (UTC)
I wonder why {{bor}} couldn't handle the formatting, very messy to nest templates like that. My program doesn't handle the other cases you mentioned, I'll add it. I can also take a look at the senseids. My general idea for the project is to move it towards something like a framework, which covers extraction, parsing etc and which can easily be extended to perform all sorts of analyses. – Jberkel 21:42, 14 April 2019 (UTC)
({{bor|en|ja|{{ja-r|御殻|おから}}}} is bad and wrong; the parentheses are also tagged as lang=ja. —Suzukaze-c 21:46, 14 April 2019 (UTC))
I think we should move the logic into the linking templates/modules, so one can just write {{bor|en|ja|御殻}}. Easier for editors and easier to parse for machines. – Jberkel 06:12, 15 April 2019 (UTC)
I've started tracking issues on gitlab. Feel free to report issues or make suggestions there. – Jberkel 08:02, 15 April 2019 (UTC)
@Jberkel: Perhaps you might have something to say at Wiktionary:Beer parlour/2019/April#Japanese entry layout revisited. —Suzukaze-c 23:49, 15 April 2019 (UTC)