Universal Repository |
I have this laptop that I bought in 2015 for 150 dollars on craigslist that I wanted to continue to get use out of.
It’s an outdated machine and the gpu is not supported by nvidia anymore either (annoying to make work right now), but still pretty ok. So I decided to use it for:
So everything is going well and im getting good use out of the machine.
Then one day I thought “Hmm I wonder if anyone has tried to login to this thing, (SSH in)” so I checked the failed login log and was shocked. Within a period of 3 days I had thousands of failed login attempts.
Immediately I thought “I need to tighted security up” so I made a crazy long password and implemented this SSH firewall called Fail2Ban.
Then I thought ….. Wait a second. These attacks are constant maybe I could study them and learn more about them. Things like:
I decided to ban people for an hour only after 5 failed login attempts. That way I could get a better distribution of data and it would make it even more difficult for someone to actually break in, even though I ran the numbers and using a brute force attack would take thousands of years still to hack into my system.
I began to understand I needed a simple file that would have:
I decided to store the data in a simple CSV file just cause its a simple lazy solution.
The original server logs were not very pretty so I needed to clean them up and put the cleaned up data in the CSV file but in an automated way since the data was constantly flowing in
I needed a pipeline !
First, though, I needed to consider a few issues I would have in the process
Caveats
After examining the logs I noticed a few issues but ill just discuss one here.
What if there were 600 or 200 though ? I don’t want to keep cleaning the same data over and over from the log file as new entries are loggged, meaning I shouldnt clean the whole file everytime there is a new entry logged because thats gonna get really redundant really quick and its gonna use too much resources.
With the caveats in mind I need a pipeline that will (Hourly):
- Drop overlapping data / duplicates.
- Drop log entries without an ip address or bad data.
- Take the IP addresses and extract the geographic details
- Take the geographic details and add them into new columns that correspond to the correct rows containing their matching IP addresses and login data.
I want other people to be able to use the data or contribute so ill also include all the git actions in the pipline after making a repository on github.
- Add the files to the staging area (Cronjob)
- Commit the files (Cronjob)
- Push the changes (Cronjob)
I implemented all of this by setting up cronjobs in my crontab which run (Hourly) a python script and all the git stuff automatically.
The End LOL