Kosovo Municipality websites

Hi,

It has been common in the past for new Municipal governments to start fresh with Municipality websites losing institutional memory in the process. Would be a good idea to scrape them just in case.

This tools looks interesting https://tech.occrp.org/blog/2017/11/21/memorious.html

Anyone wants to try it?

Thanks,

Arianit

Arianit,

what is it we want to do? Are we sure “wget [-r] municipal_website” won’t do what we need?

Thanks,

Gagi

Gagi,

That could be it but I don’t know if that’s the best way.

What do our scraping experts - Mike and Ardian - think?

Arianit

I have been using httrack substantially after Ardian recommended it
during his presentation at OSCAL, and I have been pleased. I don't think
it will work on websites written in ASP.NET or on websites where links
are composed in JavaScript. It's better than "wget -r" because it
rewrites links and you can configure it more.

I haven't tried the others that he recommended.

Maybe we could list all the municipality websites, write tests to
determine whether a particular mirroring technique worked, then try/make
different mirroring softwares until the tests pass.

They are ASP.net (Kentico CMS apparently) https://builtwith.com/?http%3a%2f%2fkk.rks-gov.net%2fdecan%2f

List of Municipal websites https://kk.rks-gov.net/

and a few secondary

http://prishtinaonline.com/

http://gjakovaportal.com/

You could also potentially use phantomjs to make screenshots of such websites

So my thoughts were to setup a yacy cluster
https://yacy.net/en/index.html for local governments

What about Wayback Machine, Mike? Could that work?

yes, that would work for hosting. projects I contribute to like the
wikiteam do that. https://archive.org/details/wikiteam

see also https://www.quora.com/Aside-from-the-Wayback-Machine-what-are-other-options-for-getting-screenshots-of-websites-from-the-past
archive.is

This project seems to be the right tools, https://github.com/ikreymer/pywb
why dont we host the archives and yacy search on our servers?

setup a docker server and load this one https://hub.docker.com/r/mkaag/yacy/
then we just need to make a list of seed sites and adjust the
parameters not to spider all sites, we can work together to review
outgoing links and put the whitelist on github

so reading this :
http://www.yacy-websearch.net/wiki/index.php/Dev:APICrawler

crawlingMode =sitelist
crawlingURL = some url we maintain with a list of sites to crawl,
could put on github
range= domain only crawl the domains we give it, not outside
indexText =on
indexMedia = on

this should give you a backup of those sites. We would just schedule
this crawl once a day or so and we could back up the last results or
diffs on archive.org to have a history of the sites.

There are some docs on auto backups
http://www.yacy-websearch.net/wiki/index.php/En:Debian_High_Availability

mike

Mike, you’re on fire now :slight_smile:

I’m tempted to mirror them all. Agron offered us a server, could we put them all there?

We need to be gentle on these servers (nights and weekends and probably scan for updates only weekly) so they don’t block us, especially those concentrated at *.rks-gov.net.

I can give you a list to start with by tomorrow.

make a list of servers, put them in a github file (or anywhere we can
edit together) and then
have agron setup that docker image i mentioned, we can go from there.


shows :
geoportal.rks-gov.net 82.114.76.12
sessh.rks-gov.net 82.114.76.24
e-prokurimi.rks-gov.net 82.114.76.36
e-edukimi.rks-gov.net 82.114.76.48
mapl.rks-gov.net 82.114.76.55
kk.rks-gov.net 82.114.76.55
pzap.rks-gov.net 82.114.76.57
oshp.rks-gov.net 82.114.76.57
dogana.rks-gov.net 82.114.76.103
mf.rks-gov.net 82.114.76.103
map.rks-gov.net 82.114.76.103
mpms.rks-gov.net 82.114.76.103
krpp.rks-gov.net 82.114.76.103
abgj.rks-gov.net 82.114.76.103
shkk.rks-gov.net 82.114.76.105
mshws.rks-gov.net 82.114.76.109
patentshoferet.rks-gov.net 82.114.76.112
aprk.rks-gov.net 82.114.76.116
mpb.rks-gov.net 82.114.76.116
www.rks-gov.net 82.114.76.120
rks-gov.net 82.114.76.120

found this http://2015.index.okfn.org/place/kosovo/

looks like cacctus did the geoportal on ms/oracle
http://geoportal.rks-gov.net/documents/10179/0/KGP_Public_UserManual_ALB.pdf/0dc1f9ec-794a-4fbb-a8cf-df44eeac4046;jsessionid=626D5160B59938C54C88619BADC5CE1E?version=1.0

this site is interesting, it has the cadaster information for all of kosovo!

Ok this is a git hub repo of what arianit sent me :

so lets start with a script that will do a one time download, zipping
and uploading to archive.org

we need to create some sitemap of them as well, which we can put in
the archive.

Maybe we should associate Whois records as well as historical whois records to the domain names. It will help us weed out phishing portals.

Just an idea, though.

//Agron

yes, agreed. So we want to get a full dump of the dns records. There
is a protocol for this, hold on.

look at what these guys did

http://commoncrawl.org/ is the major corpus of data.
it is based on http://nutch.apache.org/ so that is what we want to use

here is the tutorial https://wiki.apache.org/nutch/NutchTutorial