Is an archived version of the current nesdev forum worth it?

This is an archive of a topic from NESdev BBS, taken in mid-October 2019 before a server upgrade.
View original topic

Is an archived version of the current nesdev forum worth it?
by Banshaku on 2012-04-21 (#92707)

Hi everyone,

since we will need to find a solution requiring the hosting of nesdev before August, we should think about what to do about the current contents.

The current version of phpbb is 2.0 with some custom upgrades (I think) and maybe it time to upgrade to 3.0. I remember that it was said that it would be difficult to move the content to 3.0 so maybe we should find a way to archive to older version.

Do people thing we should put an effort to convert the current site to some static version? There exist some script to do so but since the site was modified in some way, it may not work "as-is" and may require some custom code to do it.

Another solution would be to keep the phpbb 2.0 in a locked state while we start a new forum on 3.0. But I think a static html version would be the most lightway solution.

What do people think? what would be the best solution?

by Dwedit on 2012-04-21 (#92711)

The best way to preserve the old board would be to summarize it and wikify it, but that is a momumental undertaking to do.

In the mean time, HTML dumps suck. I've seen ugly HTML dumps of wikis, and you end up with a ton of HTML files, many of which duplicate other files, and the whole thing ends up using tons of disk space due to cluster size padding.
I'd rather see plain text post contents, so you can reconstruct the board from those.

by Drag on 2012-04-21 (#92714)

Welp, a good place to start would be here, taking this information and adding it to the wiki. This is the sprite oam bug and the Young Indiana Jones quirk, both of which I still don't 100% understand (and apparently neither does anyone else).

Next, if we could finalize what we've discovered about the MMC5 scanline counter thus far and wikificate it, that'd be another good move.

by 3gengames on 2012-04-21 (#92715)

I think the threads that need saved are the ones describing mapper behavior, the CIC threads, the music composition tools thread, and just misc. threads. All the newbie threads would be nice too. All the "well how do I make a cart and make it do things" threads don't have to be saved, just take up more room than they're worth.

by MottZilla on 2012-04-21 (#92718)

There is definitely alot of content that should be saved however possible.

by tepples on 2012-04-21 (#92720)

I can download the whole BBS myself using a Python script that's specially programmed to download only valid topics (unlike wget, which gets confused). I just need koitsu to sign off on the job and give me an acceptable crawl delay so that my IP doesn't get blocked.

by koitsu on 2012-04-21 (#92721)

tepples wrote:
I can download the whole BBS myself using a Python script that's specially programmed to download only valid topics (unlike wget, which gets confused). I just need koitsu to sign off on the job and give me an acceptable crawl delay so that my IP doesn't get blocked.

I'm fine with it -- all blocking is done manually as you know. As for the crawl delay, hm, well, I'm more concerned with the rate of network traffic than I am with the fetch intervals or how many concurrent fetches are occurring.

Would it be easier (and more efficient?) to make a private (moderator-only) dump of the MySQL DB for the forum? This is something I could make + put up elsewhere (not on Parodius) for download for folks like Tepples to make use of it. I dunno how much of an undertaking that would require...

Alternately, Tepples, since the server does have Python on it (2.6.7), you could log in to the ndwiki account and run your Python script from there, storing the results in some directory, then tar -pcf dir.tar dir && gzip dir.tar and send that off somewhere. That would keep the HTTP traffic limited to (effectively, not literally) localhost, and the bandwidth/usage would only be associated with the download of dir.tar.gz.

by tepples on 2012-04-21 (#92723)

koitsu wrote:
I'm fine with it -- all blocking is done manually as you know.

Then I'll use a distinct user agent so that it'll show up in the server log as friendly. Look for something like "Pino's random browser".

Quote:
As for the crawl delay, hm, well, I'm more concerned with the rate of network traffic

Then the ideal crawl delay is roughly equal to the rate of network traffic that you want my crawler to use divided by the average size of a phpBB viewtopic page. I'll probably play with it over the next few days, starting at a 6 second delay.

Quote:
Would it be easier (and more efficient?) to make a private (moderator-only) dump of the MySQL DB for the forum?

The markup processing library that I use is for HTML, not for phpBB's flavor of BBCode. But in case someone else wants to import it into phpBB and play with it, you might as well make this dump available to the mod team.

Quote:
Alternately, Tepples, since the server does have Python on it (2.6.7), you could log in to the ndwiki account and run your Python script from there, storing the results in some directory, then tar -pcf dir.tar dir && gzip dir.tar and send that off somewhere. That would keep the HTTP traffic limited to (effectively, not literally) localhost

In other words, it'd keep the HTTP traffic on the LAN. I'll keep the ndwiki shell account in mind once I get the forum and wiki crawlers stable.

by koitsu on 2012-04-22 (#92724)

Sounds good Tepples. Thanks for customising the fetches/etc. so if you run into any problems (or I find any) it'll be easier to pinpoint. :-)

by tepples on 2012-04-22 (#92725)

Which leaves one problem: we need a new BBS running ASAP so that we can lock this one and I can scrape it without running the risk of the scrape being incomplete. Otherwise, I think I have a working scraper for the phpBB; I'll get to the wwwThreads later.

by koitsu on 2012-04-22 (#92728)

Well, I could try migrating the existing board to phpBB 3.0 (without messing with the existing board + existing MySQL DB), but I'm not sure about the nesdev theme/skin that's used here.

Or if you guys want to "start fresh" (everyone having to sign up again, etc.) that's probably also fine but might annoy folks a bit.

Alternate method:

1. Find new hosting first (at a new URL/site name)

2. Lock down nesdev.com/bbs/ to be read-only. If phpBB can't do this, the best choice of action I can propose is to put up some temporary access limits that return Forbidden to everyone except Tepples' IP. I can do this without much effort.

3. Let Tepples run his backup script -- which in this case should probably be run full throttle (i.e. no sleep/delays/etc.). That way the board would be down as short as possible.

4. Set up board software + etc. on the new provider -- preferably with migration of usernames/passwords. For example, at the new host, use phpBB 3.0 but tell it (during the migration process) to try and import all the old users/posts. (I believe the last time I tried this it worked okay)

5. Redirect the nesdev.com/bbs/ URL (and other URLs if need be) to the new provider/URL.

If anything needs to be done in real-time, I can show up on EFnet #nesdev or somewhere similar for the duration of the move if that'd make communication easier.

by Xkeeper on 2012-04-22 (#92732)

Let me be the first to outright state that any solution that does not satisfy the condition that the current forum must remain as live as it is now is no solution at all. Having had to re-start sites (even with archives...) is a massive pain and means that any time someone wants to reference that old thread, now they have to search two/three plus pages for it, and it makes a lot of the old tools worthless.

Is there a reason to upgrade to phpbb3? If this one is secure enough to have not gotten defaced over the years and works well enough, I don't really see a reason for it to be upgraded...

by Jeroen on 2012-04-22 (#92738)

Tepples: how does you script differentiate "usefull" posts from non usefull ones? Or is it just gonna dump everything?

edit: upon rereading you said "valid" ...you mean non locked threads etc?

by UncleSporky on 2012-04-22 (#92750)

Just wanted to say I'm glad that tepples is on the job on this. An archive would be invaluable. When I was just starting out learning the NES I considered making topics for a number of questions, but I always searched the forum first, and 99% of the time found that the question had been asked before and answered in detail.

Heck, whenever I do anything with the NES (infrequently these days...) I still search the forum for quick answers to various questions.

Even though good resources are available on the wiki and other places, this forum is still an amazing resource. I'd love to see it maintained without any culling.

How much space could that possibly take?

by tepples on 2012-04-22 (#92756)

Jeroen wrote:
Tepples: how does you script differentiate "usefull" posts from non usefull ones? Or is it just gonna dump everything?

I'll save all topics.

Quote:
edit: upon rereading you said "valid" ...you mean non locked threads etc?

Valid = not deleted.

Anyway, I think I have the phpBB scraper working. I've tested it on t=100 through 199, which I chose as a stress test because of a huge music rip request thread in that range. I'll start it over and run it on all topic IDs once we make all forums read-only (which is possible in the admin panel).

As for the wiki, we can keep on editing that up until my last scrape because I can get the last modified date for all pages, ten at a time, and then go re-scrape any page that's newer than the version I have. I'm saving both the mark-up and the rendered HTML using the MediaWiki API.

Scraper status as of right now:
wwwThreads: Not yet started
phpBB 2: Works, needs to be run while frozen
Wiki: Works for pages, not yet started for images, not yet supporting timestamps

In any case, there are so many inbound links to nesdev.com, both the front page and the BBS, that I think it'd be wise to permanently arrange for redirection to wherever nesdev.com ends up so that cool URIs don't change. Or is a video game publisher whose name starts with K behind this?

by Banshaku on 2012-04-22 (#92784)

For the wiki, I can receive a backup of the DB and we can figure out a way to move it. So for now, I don't know what is the value of scrapping the pages inside wiki. We are not going to lose the data, we can request it.

The same thing can be done for the bbs if we ask Koitsu. It all depends on the goal we have about the future of the BBS (keep phpBB 2.0 data or start from scratch).

Since Memblers is the rightful owner of the BBS, I guess we should ask him which direction he wants to go. He seems to have access to some hosting too.

As for the nesdev.com domain, we can negociate with Koitsu to receive the domain for nesdev wiki/bbs. This could be a chance to unify the nesdev community under the same name like:

- wiki.nesdev.com
- bbs.nesdev.com

or whatever we find appropriate.

I think we shouldn't rush thing and make the proper decision: we still have some time and Koitsu will help up any way he can (he's a nice guy, you know).

by Bregalad on 2012-04-23 (#92792)

I think it would be a better idea to start from scratch using a more modern BBS system, however it's just a suggestion.

Some of the threads have relevance but I think that at least 95% of them are no longer relevant. If someone asks a question and gets his answer back in 2007, there is no real value to this.

Only a few threads about new discoveries about the hardware or things like that are actually relevant.

by clueless on 2012-04-23 (#92795)

I have no strong opinion on changing BBS software and/or version. But if it is changed, I request something that supports giving RSS feeds for channels.

I vote for keeping the entire forums archived. I feel that the history of each discussion is worth preserving. I first joined you guys+gals in October, 2010. I spent a great deal of time reading old posts - getting to know each of you before I ever posted. I remember reading the epic thread on the reverse engineering of the CIC and feeling in awe of how bad-ass you all are.

I had no idea how much nesdev costs to run. I would have pledged $100/year or more had I known.

I am glad that Tepples is working on the archiving the data.

Tepples, can you craft a python program that I can run on my Linux desktop to scrape all of my PMs. Dumping them into a single TXT or HTML file is fine. If not, I'll start on a utility in a week or so, but mine will be in perl.

Koitsu/Tepples/Memblers: are our BBS and Wiki passwords in the mysql database plain-text, hashes or fully reversible encryption?

I'm fine with a dump (or portion of) of the mysql database becoming available for public download, so long as the password hash/whatever is wiped and PMs are omitted.

Guys, keep the ideas flowing.

by Banshaku on 2012-04-23 (#92797)

clueless wrote:
Koitsu/Tepples/Memblers: are our BBS and Wiki passwords in the mysql database plain-text, hashes or fully reversible encryption?

I did the install of the wiki at the beginning but I don't know how mediawiki secure their password. I guess they must hash them at the least but I must confirm.

We could still keep the wiki/phpbb "as-is" on the next server. maybe this is the best solution after all since nobody would lose any data. Thinking about how to archive the content in the future for any possible lost of data wouldn't be a bad idea though.

by tepples on 2012-04-23 (#92798)

MediaWiki passwords are stored as md5($salt . '-' . md5($password)). In really old versions of MediaWiki, $salt was fixed to the user ID, but now it can vary.

If I were doing it, I'd import the wiki through a database dump, start the BBS fresh with phpBB 3, and keep static HTML versions of BBS 1.0 (wwwThreads) and BBS 2.0 (this phpBB 2 board). Then there might be a project on the new NESdev to go through each of the almost 9000 topics on BBS 2.0 and summarize relevant information on the wiki.

Tonight I plan to add timestamp checking and image downloading to the wiki scraper so that I can keep running it weekly to pick up page changes.

by MottZilla on 2012-04-23 (#92799)

I'm not sure what benefit if any there is to upgrading. Personally I haven't noticed anything wrong with the forum.

by thefox on 2012-04-23 (#92800)

MottZilla wrote:
I'm not sure what benefit if any there is to upgrading. Personally I haven't noticed anything wrong with the forum.

I can see benefits to upgrading (new features and possibly better security), but what benefits are there to starting from scratch with phpBB3 if the automated conversion from the phpBB2 database works fine?

by Dwedit on 2012-04-23 (#92801)

Starting from scratch is stupid. The board should be easy enough to reconstruct on any flat-style board system, given the original unparsed plain text for each post, poster, subject, post time-date, and thread ID. The only thing that might get lost is individual post subjects, which some board systems don't use.
I also don't like the idea of a read-only archived board.

by Bregalad on 2012-04-23 (#92813)

Dwedit, then what is wrong with the way we moved from the old boards ?

We done it exactly what you call stupid, starting from scratch and having a read-only archived board.
I think the transition was handled greatly back then.

Also guys, you should know that everything has a start and an end, including your own life, and the messages you writes, etc....
People nowdays seems to assume that since things have become digital, it will last forever.
It won't. Future generations won't want to deal with billions of hard disks of data that was used by their ancestors and will have no room to physically store all this, not to mention the hard disks themselves will fail, etc...

I think stuff that is relevant should be preserved, and non-relevant things should be cleaned up sometimes in a while. This applies to my room too, how many crapload of stuff I have that is no longer relevant ?

Anyways it's not up to me to decide, I was just pointing that deletion of some data is not the end of the world, and can even be good as it means less required storage space for the new future Nesdev website, so less costs, etc...
Of course I shouldn't say stuff should be deleted without thinking - that's NOT what I said.
I think relevant threads or info should be archived and cleaned up and moved to somewhere brand new.

The most important thing to me is that the Nesdev community continues to live somewhere, that people continues to make games, demos, tools, and cool hardware quick discoveries about the NES, and that the hardware and the tools related to the NES (assemblers, tile editors, etc...) are properly available and documented, so that someone who wants to make a NES game can actually do it.

Starting something brand new should be seen as a chance to replace this old thread that, while great, is regularly flooded by spamers, and the good but messy wiki, by a stronger and better new structure. Keep the old stuff and copy it as it somewhere else would not do any good to the Nesdev community.

by Dwedit on 2012-04-23 (#92816)

Didn't even know the old boards existed. My join date is 8 days after the old ones closed.

by koitsu on 2012-04-23 (#92817)

Mainly for Tepples:

After having a chance to review logs, bandwidth graphs, etc. I'd like to propose that the harvesting script for the existing 2.0 BBS -- assuming that is the route the community/mods/etc. decide to do -- be run on the server itself, followed by the directory being tar + gzip'd up (or zip; doesn't matter to me) then copied off somewhere where it can be extracted.

The reason has to do with bandwidth usage. The existing infrastructure (that would be the 2.0 BBS) does not make use of deflate/gzip compression via HTTP, thus the bandwidth used via HTTP-over-the-Internet will be much higher than if the script was run locally then the results compressed + scp/rsync/ftp'd off somewhere.

Gotta keep in mind that we've got 40-50 hosting people who are all in the process of trying to offload their stuff, so I'm having to juggle a lot of "bandwidth balls" (haha) in the air as a result.

Also, there was concern over the existing links that point to http://nesdev.com/{whatever} -- no, there is zero guarantee that I will continue to provide DNS records within the parodius.com domain (or subdomains) after October 31st. So if you want links to work, they should be re-written by the Python script. It shouldn't be that hard:

s#http://nesdev.com/bbs/#http://new.url/forum/#g;
s#http://nesdev.com/#http://new.url/#g;

You get the idea (and order of s/// statements matters as I'm sure you can figure out). That should pretty much do it.

by Banshaku on 2012-04-23 (#92819)

@Koitsu:

wouldn't if Tepples had a local copy of the bbs db and have fun to his heart content be easier in a way? The database is not that big actually.

by tepples on 2012-04-23 (#92837)

koitsu wrote:
After having a chance to review logs, bandwidth graphs, etc. I'd like to propose that the harvesting script for the existing 2.0 BBS -- assuming that is the route the community/mods/etc. decide to do -- be run on the server itself

To run the wwwThreads and phpBB 2 harvesting scripts on the server and keep the traffic on the LAN, as you mentioned earlier, I'd need html5lib for Python installed. Is this something a user can do or something root has to do?

Quote:
So if you want links to work, they should be re-written by the Python script.

That's doable. The bigger problem here is inbound links from other web sites.

Anyway, I ran my "mirroiting" process on one BBS topic, and it ended up looking like this:
Original topic (261.0 KiB)
Archived version (113.2 KiB)

by Anthony J. Bentley on 2012-04-23 (#92842)

Xkeeper wrote:
Let me be the first to outright state that any solution that does not satisfy the condition that the current forum must remain as live as it is now is no solution at all. Having had to re-start sites (even with archives...) is a massive pain and means that any time someone wants to reference that old thread, now they have to search two/three plus pages for it, and it makes a lot of the old tools worthless.

Dwedit wrote:
Starting from scratch is stupid. The board should be easy enough to reconstruct on any flat-style board system, given the original unparsed plain text for each post, poster, subject, post time-date, and thread ID. The only thing that might get lost is individual post subjects, which some board systems don't use.
I also don't like the idea of a read-only archived board.

I agree with an upgrade/import of the existing board rather than starting new. Archives make searching for particular topics that much harder and require people to re‐register. Sometimes this is necessary, but with a SQL dump of the database it’s totally possible to preserve everything. Why do anything else?

by koitsu on 2012-04-23 (#92844)

tepples wrote:
koitsu wrote:
After having a chance to review logs, bandwidth graphs, etc. I'd like to propose that the harvesting script for the existing 2.0 BBS -- assuming that is the route the community/mods/etc. decide to do -- be run on the server itself

To run the wwwThreads and phpBB 2 harvesting scripts on the server and keep the traffic on the LAN, as you mentioned earlier, I'd need html5lib for Python installed. Is this something a user can do or something root has to do?

It's now installed. :-)

by koitsu on 2012-04-23 (#92845)

Banshaku wrote:
@Koitsu:

wouldn't if Tepples had a local copy of the bbs db and have fun to his heart content be easier in a way? The database is not that big actually.

The issue then becomes formulating MySQL queries that get back exactly what Tepples wants. This might require a lot of reverse-engineering the phpBB 2.x code, which like most open-source forum projects is spaghetti.

If he's willing to do and write all that, I will be more than happy to put up a mysqldump of the DB (will provide URL in the Moderators forum). If he's not sure and needs the DB + schema to work with for starters, that's fine too, I'll be happy to put up a DB dump. Let me know either way if this is what we want to do, as I'm fine with it.

Just remember that the DB dump immediately becomes outdated the instant someone posts on the forum, which is why we're trying to figure out what's best (re: locking down the forum so nobody can post, migrate to something new, or what).

Folks reading this should probably also read this.

by tepples on 2012-04-24 (#92850)

Anthony J. Bentley wrote:
Archives make searching for particular topics that much harder

I'll see what I can do about search, now that I can get the full text of every post.

koitsu wrote:
It's now installed.

Thank you. I might try harvesting phpBB 2 from the ndwiki shell account in a couple days, after I figure out how to harvest wwwThreads. I've pretty much finished harvesting the wiki already, and I just need to run small updates weekly to bring it up to date with recent changes.

koitsu wrote:
The issue then becomes formulating MySQL queries that get back exactly what Tepples wants. This might require a lot of reverse-engineering the phpBB 2.x code, which like most open-source forum projects is spaghetti.

If I were to reverse engineer data out of a dump, I'd probably put MySQL and phpMyAdmin on my laptop and do it from the tables. That's what I did at my last job: it involved reverse engineering a commercial off-the-shelf order fulfillment package from the tables in order to allow other custom tools to interact with the data through ODBC.

koitsu wrote:
Just remember that the DB dump immediately becomes outdated the instant someone posts on the forum

Yeah, that's a big difference between phpBB 2 and MediaWiki. The MediaWiki API allows pulling the last modified date for each page, so I know exactly what to update.

by Banshaku on 2012-04-24 (#92854)

I may understand that we could want an archive of the bbs but I still don't get why we crawl the wiki if we are going to use it as-is on the next server? Or did I miss something?

So what is the purpose of the crawling of the wiki? Is it to remake the static html version that couldn't be done anymore? I would like to know the goal.

by Hamburgler on 2012-04-24 (#92855)

Why not just grab a backup of the database from the phpBB admin control panel?

by tepples on 2012-04-24 (#92856)

Banshaku wrote:
So what is the purpose of the crawling of the wiki? Is it to remake the static html version that couldn't be done anymore?

That was in fact the original goal of my wiki crawler.

Hamburgler wrote:
Why not just grab a backup of the database from the phpBB admin control panel?

THANK YOU. I didn't know that was there. That'll simplify some things.

by Banshaku on 2012-04-24 (#92858)

I see. If it does work well, it will be useful at a later stage since the previous wikimedia plug-in doesn't work anymore.

For now, I don't think you should worry too much about the wiki crawling. My guess is the focus should be done on what is the next stage for the bbs with Memblers, if an archive is a good thing or not.

Or maybe the focus should be move first then find a solution of the archive later.

by tepples on 2012-04-24 (#92866)

I tried running my wwwThreads crawler on the Parodius SSH server so that it'll run without any external traffic. I found there's another Python module that's not installed.
Code:
Traceback (most recent call last):
File "./scrape_bb1.py", line 212, in <module>
main()
File "./scrape_bb1.py", line 198, in main
import sqlite3
File "/usr/local/lib/python2.6/sqlite3/__init__.py", line 24, in <module>
from dbapi2 import *
File "/usr/local/lib/python2.6/sqlite3/dbapi2.py", line 27, in <module>
from _sqlite3 import *
ImportError: No module named _sqlite3

by koitsu on 2012-04-24 (#92870)

I'm working on fixing the lack of sqlite3. It appears the FreeBSD port for this is horribly, *horribly* broken. I am not a happy camper.

Edit: I see the problem. It's still idiocy/brokenness in the FreeBSD ports system, but I know how to address it.

Sorry about all the lack of modules -- absolutely no user of ours uses Python. The only reason Python is installed at all is because it's a build requirement for Apache. (I threw quite a tantrum on mailing lists about that too.)

Edit: Done. Please note I had to upgrade Python to 2.6.8 to get any of this to work. That means html5lib had to be upgraded, etc.. So if you run into new problems in that library, we'll deal with that when we get there. Of course, I'm also not sure why a web crawler would need SQLite.......

by tepples on 2012-04-25 (#92887)

koitsu wrote:
I'm working on fixing the lack of sqlite3
[...]
Edit: Done.

Thanks.

Quote:
Of course, I'm also not sure why a web crawler would need SQLite.......

It lets me store exactly how far along I am in the crawl, and it puts all the harvested pages in a single file so that they don't use up so many inodes and so much slack space. Then I can gzip up a single .sqlite file and pull it to my machine once I'm done. I'll try it this evening after I've made a refinement to the schema.

by clueless on 2012-04-25 (#92889)

tepples wrote:
It lets me store exactly how far along I am in the crawl, and it puts all the harvested pages in a single file so that they don't use up so many inodes and so much slack space. Then I can gzip up a single .sqlite file and pull it to my machine once I'm done. I'll try it this evening after I've made a refinement to the schema.

SQLite rocks. I use it for several projects and have a fair bit of experience with it. SQLite has a very vibrant user + dev community.

A suggestion: before you compress the sqlite database, "VACUUM" it first. Even on an "insert only" database, vacuum can typically make it smaller (it re-balances the b-trees). On a database that has had updates and deletes, vacuum will almost always make the file smaller.

by tepples on 2012-04-25 (#92894)

clueless wrote:
before you compress the sqlite database, "VACUUM" it first.

Thanks for the tip. I added a VACUUM statement at the end of the wwwThreads crawler.

As of right now, the wiki is crawled, and wwwThreads is in progress. My next things to do:
Make downloaded wiki data browsable
Make downloaded wwwThreads data browsable
Rewrite phpBB 2 crawler to store information more efficiently based on what I learned making the wwwThreads crawler
Make a JavaScript full-text search engine for all three crawled sites

Once these scripts stabilize, I'll make them available in case someone else on Parodius wants to archive a site.

by Zepper on 2012-04-25 (#92901)

Thanks for making a backup of this forum. It would be nasty restarting from scratch. Basically, all the golden info is inside here.

by tepples on 2012-04-25 (#92915)

Hamburgler wrote:
Why not just grab a backup of the database from the phpBB admin control panel?

Here's why not:

Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 32 bytes) in /home/memblers/nesdev/bbs/db/mysql4.php on line 199

I tried to back up the database and I got that error around the time the dump reached phpbb_search_wordlist.

Anyway, I've crawled wwwthreads and the wiki in their entirety.
wwwthreads.sqlite is 7.8 MiB unzipped or 2.2 MiB gzipped. It was crawled from within Parodius. It contains the HTML of each post in each topic.
nesdevwiki.sqlite is 4.6 MiB unzipped or about 1 MiB gzipped. It was crawled from home, and I update it incrementally about once a week. It contains both the wiki markup and HTML versions of each post. It's so much smaller than a database dump because it contains only the latest revision, not the entire revision history of each article.

So if I can't export a database dump, I'll have to build a database of publicly available HTML posts. I've run the crawler on my computer (limited to GB, SNES, and Other Retro forums) to give me a limited data set with which to test a reformatting tool, and I'll run it again within Parodius in mid-August to get a full snapshot.

by koitsu on 2012-04-25 (#92918)

tepples wrote:
Here's why not:

Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 32 bytes) in /home/memblers/nesdev/bbs/db/mysql4.php on line 199

This is due to a PHP setting called memory_limit, which defaults to 128MBytes (give or take some room). We can adjust this. However, it indicates that phpBB 2 is horribly inefficient at doing the backup (it should not be storing all of this crap in memory. Grrr...)

Tepples, I'll work on getting you a full mysqldump of the DB tonight and give you a URL for it via a PM. You can download it to your hearts content + at high speed, as it'll be hosted at a place which has no bandwidth limits whatsoever. I'll name the file with the UNIX timestamp of when the dump was taken, and the dump will use table locking so it should be intact.

Edit: PM sent.

by tepples on 2012-04-26 (#92957)

Snapshot saved. Thanks.

In my full scrape of SNES, GB, and Other Retro forums, I've been able to transform links within posts to links to static HTML versions of the respective topics, as well as pages in both hostnames that the wiki has used (nesdevwiki.ath.cx and wiki.nesdev.com). Of course, I'll update the specific formats once migration plans become more solid.

by B00daW on 2012-04-26 (#92971)

Can we archive the old NESdev forum as well just for historical development aspects? There may be a lot of information to mine in there as well. We could keep it locked and archived just as we do now.

by tepples on 2012-04-26 (#92973)

By "old" do you mean the wwwThreads one? As I wrote above: "I've crawled wwwthreads and the wiki in their entirety."

by tepples on 2012-04-28 (#93043)

Bump: Now's your chance to try it out. This is an example of what the archive of the BBS will look like. The archive of wwwThreads will look similar. Pardon the 404s; only GBDev, SNESdev, and Other Retro Dev are populated. Full-text search isn't done yet either.

by Banshaku on 2012-04-29 (#93098)

I did a quick try and technically the content is fine. What would be interesting is:

- Show name of poster in the thread list
- if possible, a template with the exact same layout as bbs would be great

Except for that, it seems fine.

by tepples on 2012-04-30 (#93107)

Thank you for the feedback.

Banshaku wrote:
Show name of poster in the thread list

I've taken some cues from mobile-first, distraction-free design, presenting only what is needed: title, post date, and post count. In the list of topics, how would including the name of the author of the first post in a topic help?

But I guess that for topics with more than a month between the first and last posts, the date of the last post would be useful.

Quote:
if possible, a template with the exact same layout as bbs would be great

I'm trying to make the layout fit into 80 columns (CSS: 40em). Longer line lengths tend to lead to eye strain and cause problems on smaller screens. Toward that end, I've put a maximum width on <img> elements so that they don't widen pages. I'm also trying to make the WWWThreads and phpBB layouts consistent: my WWWThreads archive uses exactly the same layout as what you see here.

by snarfblam on 2012-04-30 (#93128)

tepples wrote:
In the list of topics, how would including the name of the author of the first post in a topic help?

In most cases it probably wouldn't. If you're looking for a specific topic started by a specific person, it might come in handy (it might be easier to just google it in that case). The only other situation where it might make a difference is that good questions tend to get good answers. If you know who's better at asking good questions, it could help to see who started the topic.

I'm not advocating one position or the other, just trying to answer the question.

by Banshaku on 2012-04-30 (#93131)

For the author in the list of thread, if I'm "joy browsing" and find back topics by an interesting author, I may want to read it again. If I'm a new reader and I remember some specific author that had interesting threads, I may read them. So it's quite subjective but still, why removing information that was already there? Now I have to click a topic to find who wrote it. Some information has more value for some people than others. The last person that posted has less value than the author of the thread.

As for the layout, I think that personally is thinking a little bit too much (eye strain thing) for just some scrapping a website. My reasoning behind the layout is more an emotional one than a practical one: I want it to be close to the original has possible. For new readers, it may makes no difference but for us old "geezers", there is some memory attached to those threads.

That's only my opinion though!

by UncleSporky on 2012-05-02 (#93179)

I agree, it'd be easier if it looked more similar to the current forums, but I understand if limitations prevent that.

Will threads with images look the same? I made quite a few posts about graphics and I'd like to see them preserved. As long as they still point to the photobucket images, they'll be visible indefinitely.

by tepples on 2012-05-02 (#93182)

Images have a max-width: 100% which means they will be displayed but will be shrunk if they would widen the page. To view full size, right-click and choose View Image. An example is the oscilloscope screenshot in this post (original).

For "joy browsing", I have changed the username in the post header into a link to the user's posting history.

Known bug: I still need to add a line of CSS to put class="code" in a monospace font.

by Banshaku on 2012-05-02 (#93214)

tepples wrote:
For "joy browsing", I have changed the username in the post header into a link to the user's posting history.

That is useful "but" that doesn't allow me to check which thread a user started, just what he posted. If I can look page per page and check who made that thread, in some case it is better I think when I browse a specific forum. This could be a personal preference. Is it something hard to add to the scrapper?

by Kit Sniper on 2012-05-02 (#93217)

UncleSporky wrote:
I agree, it'd be easier if it looked more similar to the current forums, but I understand if limitations prevent that.

Will threads with images look the same? I made quite a few posts about graphics and I'd like to see them preserved. As long as they still point to the photobucket images, they'll be visible indefinitely.
... Photobucket has a habit of deleting images at random, so if a backup is made, the pictures should ideally be backed up too. :\

by snarfblam on 2012-05-02 (#93218)

Not that it wouldn't be nice to backup the images, but the images aren't hosted by NesDev in the first place. In any other circumstance they would expire after their natural life. And I can understand a "while we're at it" mentality, but images can be bandwidth hogs.

by tepples on 2012-05-03 (#93233)

As for images, I could do that in theory. I'll have to add a tool to my phpBB scanner to see how many images there are. Going forward on the new forum, I think the best option would be for users to make wiki accounts and upload the images to the wiki. I'll have more to say after my own experience with QuestyCaptcha continues. QuestyCaptcha is similar to the Q&A CAPTCHA described here, and from my first week of using it, it appears far more effective than Google's reCAPTCHA.

As for browsing topics whose first post is by a given user, I had no idea this was such a desirable use case. I can get the information with one extra JOIN in the get topic or get posting history query; the problem is just how to present the results.

by Banshaku on 2012-05-03 (#93295)

tepples wrote:
As for browsing topics whose first post is by a given user, I had no idea this was such a desirable use case. I can get the information with one extra JOIN in the get topic or get posting history query; the problem is just how to present the results.

I'm not sure if we are talking about the same thing though. Right now, what I'm talking about is that I want the list of topic in an archived forum to contain the name of the author beside the title. Adding an extra search feature to find all topics from a poster was not what I meant, although interesting as a concept.

by UncleSporky on 2012-05-06 (#93444)

Kit Sniper wrote:
... Photobucket has a habit of deleting images at random, so if a backup is made, the pictures should ideally be backed up too. :\

You must be thinking of imageshack or tinypic. I've never heard of Photobucket deleting pics unless they're offensive. I have an actual Photobucket account that I can log into to view pics I've uploaded and albums I've set up. It's not paid or anything, but I believe as long as I show a certain amount of minimum activity it won't be disappearing anytime soon. Nothing's been randomly deleted from my account since I opened it 8 years ago.

However it is true that the best possible archive would also contain the pictures so they can never expire. A lot of potentially-useful threads here are already wastelands due to linked pictures and files having expired long ago.

by 3gengames on 2012-05-07 (#93461)

If you upload them without logging in, they will eventually be deleted.

by tepples on 2012-05-14 (#93769)

So I guess that's a yes for archiving external images as long as the robot control file for that origin doesn't Disallow it. Once I implement this, I'll make a report as to what origins showed Disallow. But do I need to archive the results of polls?

EDIT: I have started to develop the image archiving tool. I'll test it first on GB/SNES/Other Retro, and I'll run it from home so as not to burden Parodius with extra Internet bandwidth. Some images will not be archived because the server disallows robots or forbids Referer: http://nesdev.com/bbs/ .

by tepples on 2012-05-21 (#94211)

The image archiver is pretty much working, and so is the private message archiver. Who has Python and wants to try the private message archiver first?

by clueless on 2012-05-21 (#94212)

I have python installed on my Gentoo Linux box. I'd like to test the PM archiver. What do I need to do?

by tepples on 2012-05-21 (#94213)

Four steps:
Install html5lib
Download the archiver and unzip it
Edit the username and password
From a terminal window, python scrape_privmsg.py

by clueless on 2012-05-21 (#94214)

Awesome. It works great. I understand the need for "napTime", but it was a little annoying

On Gentoo Linux, I installed v 0.90 of html5lib (it was the version preferred by portage) ("emerge -avu dev-python/html5lib").

Thank you very much Tepples.

by psycopathicteen on 2012-05-22 (#94245)

This thread has been up for an entire month, and we still never closed down this website.

Anyway, I'm for an archived version of this website. There are just so many websites with incorrect information.

by Kasumi on 2012-05-22 (#94253)

psycopathicteen wrote:
This thread has been up for an entire month, and we still never closed down this website.

Anyway, I'm for an archived version of this website. There are just so many websites with incorrect information.

Did you read the announcement that was the catalyst for all this? See here. We have several months, because Koitsu was kind enough to give LOTS of notice.