Kosovo Municipality websites

arianit · June 24, 2020, 4:50pm

Hi,

It has been common in the past for new Municipal governments to start fresh with Municipality websites losing institutional memory in the process. Would be a good idea to scrape them just in case.

This tools looks interesting https://tech.occrp.org/blog/2017/11/21/memorious.html

Anyone wants to try it?

Thanks,

Arianit

Ngadhnjim_Berani1 · June 13, 2020, 11:02am

Arianit,

what is it we want to do? Are we sure “wget [-r] municipal_website” won’t do what we need?

Thanks,

Gagi

arianit · June 24, 2020, 4:50pm

Gagi,

That could be it but I don’t know if that’s the best way.

What do our scraping experts - Mike and Ardian - think?

Arianit

tlevine · June 13, 2020, 11:02am

I have been using httrack substantially after Ardian recommended it
during his presentation at OSCAL, and I have been pleased. I don't think
it will work on websites written in ASP.NET or on websites where links
are composed in JavaScript. It's better than "wget -r" because it
rewrites links and you can configure it more.

I haven't tried the others that he recommended.

Maybe we could list all the municipality websites, write tests to
determine whether a particular mirroring technique worked, then try/make
different mirroring softwares until the tests pass.

arianit · June 24, 2020, 4:50pm

They are ASP.net (Kentico CMS apparently) https://builtwith.com/?http%3a%2f%2fkk.rks-gov.net%2fdecan%2f

List of Municipal websites https://kk.rks-gov.net/

and a few secondary

http://prishtinaonline.com/

http://gjakovaportal.com/

Artur_D1 · June 13, 2020, 11:02am

You could also potentially use phantomjs to make screenshots of such websites

mdupont · July 7, 2020, 2:20pm

So my thoughts were to setup a yacy cluster
https://yacy.net/en/index.html for local governments

arianit · June 24, 2020, 4:50pm

What about Wayback Machine, Mike? Could that work?

mdupont · July 7, 2020, 2:20pm

yes, that would work for hosting. projects I contribute to like the
wikiteam do that. https://archive.org/details/wikiteam

see also https://www.quora.com/Aside-from-the-Wayback-Machine-what-are-other-options-for-getting-screenshots-of-websites-from-the-past
archive.is

This project seems to be the right tools, https://github.com/ikreymer/pywb
why dont we host the archives and yacy search on our servers?

mdupont · July 7, 2020, 2:20pm

setup a docker server and load this one https://hub.docker.com/r/mkaag/yacy/
then we just need to make a list of seed sites and adjust the
parameters not to spider all sites, we can work together to review
outgoing links and put the whitelist on github

mdupont · July 7, 2020, 2:20pm

so reading this :
http://www.yacy-websearch.net/wiki/index.php/Dev:APICrawler

crawlingMode =sitelist
crawlingURL = some url we maintain with a list of sites to crawl,
could put on github
range= domain only crawl the domains we give it, not outside
indexText =on
indexMedia = on

this should give you a backup of those sites. We would just schedule
this crawl once a day or so and we could back up the last results or
diffs on archive.org to have a history of the sites.

There are some docs on auto backups
http://www.yacy-websearch.net/wiki/index.php/En:Debian_High_Availability

mike

arianit · June 24, 2020, 4:50pm

Mike, you’re on fire now

I’m tempted to mirror them all. Agron offered us a server, could we put them all there?

We need to be gentle on these servers (nights and weekends and probably scan for updates only weekly) so they don’t block us, especially those concentrated at *.rks-gov.net.

I can give you a list to start with by tomorrow.

mdupont · July 7, 2020, 2:20pm

make a list of servers, put them in a github file (or anywhere we can
edit together) and then
have agron setup that docker image i mentioned, we can go from there.

mdupont · July 7, 2020, 2:20pm

mdupont · July 7, 2020, 2:20pm

found this http://2015.index.okfn.org/place/kosovo/

looks like cacctus did the geoportal on ms/oracle
http://geoportal.rks-gov.net/documents/10179/0/KGP_Public_UserManual_ALB.pdf/0dc1f9ec-794a-4fbb-a8cf-df44eeac4046;jsessionid=626D5160B59938C54C88619BADC5CE1E?version=1.0

this site is interesting, it has the cadaster information for all of kosovo!

mdupont · July 7, 2020, 2:20pm

Ok this is a git hub repo of what arianit sent me :

github.com

flosskosova/government_websites/blob/master/seeds.txt

kk.rks-gov.net 
gjakovaportal.com
prishtinaonline.com
aaaparr.rks-gov.net
abgj.rks-gov.net
abgj.rks-gov.net 
ad.rks-gov.net
ak.rks-gov.net
aks.rks-gov.net
ame.rks-gov.net
aprk.rks-gov.net 
aqp.rks-gov.net
data.rks-gov.net
dogana.rks-gov.net 
e-edukimi.rks-gov.net 
e-prokurimi.rks-gov.net 
geoportal.rks-gov.net 
gzk.rks-gov.net
ipkmasht.rks-gov.net
ipk.rks-gov.net

This file has been truncated. show original

so lets start with a script that will do a one time download, zipping
and uploading to archive.org

we need to create some sitemap of them as well, which we can put in
the archive.

as9902613 · June 13, 2020, 11:02am

Maybe we should associate Whois records as well as historical whois records to the domain names. It will help us weed out phishing portals.

Just an idea, though.

//Agron

mdupont · July 7, 2020, 2:20pm

yes, agreed. So we want to get a full dump of the dns records. There
is a protocol for this, hold on.

look at what these guys did

http://commoncrawl.org/ is the major corpus of data.
it is based on http://nutch.apache.org/ so that is what we want to use

here is the tutorial https://wiki.apache.org/nutch/NutchTutorial