Essential tool designed to help you merge HostLists of different types with structure recognition, records recover rules and git support, cli version.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
This repo is archived. You can view files and clone it, but cannot push or open issues/pull-requests.
Delta503 8c039e0a0f Working on graalvm native-image and memory optim 6 months ago
.settings Added current work 6 months ago
src/main/java/org/hostliststools/cli Working on graalvm native-image and memory optim 6 months ago
.ci.yml Added current work 6 months ago
.classpath Added current work 6 months ago
.gitignore Added current work 6 months ago
.project Added current work 6 months ago
LICENSE Initial commit 6 months ago
README.md Added current work 6 months ago
dependency-reduced-pom.xml Working on graalvm native-image and memory optim 6 months ago
org.hostliststools.cli.app.baseapp.build_artifacts.txt Working on graalvm native-image and memory optim 6 months ago
pom.xml Working on graalvm native-image and memory optim 6 months ago

README.md

hostliststools-cli

Build Status

  1. About
  2. Features
  3. Installation and first configuration
  4. Manual
  5. Requests and collaboration
  6. Technical information
  7. Benchmarks
  8. Explanation
  9. Disclaimer

1. About

Intro

This is app written in java to manage (merge, remove duplicates, try to recover broken records) big set of host lists like adaway, energized etc... This is made to simplify process of managing hosts lists from multiple sources

Motivation

I use pihole a lot but when I experiment with server sometimes I have to re-install everything ;), In PiHole I have no option to backup and restore source lists easily. I'm also not satisfied with sequential download speed and recovering broken domains with e.g. ";0" or "/0" after records in pihole.

I found no suitable software to manage hostlists in this meaning so I wrote it.

With this app I can automate and marge my 88~ sources (around 5'700'000 unique records) into one, on faster machine and push it to own repo then synchronize with pi definitely faster.

2. Features

OS

  • Linux
  • BSD (Native git - bash script compatibility - untested)
  • Mac OSX (Native git - bash script compatibility - untested)
  • Windows (Native git - bash script compatibility - untested)
  • Android? (I found it useless because of need to create separate project, UI design and git integration, but it may be a good exercise as I've never written any app for it)

Structure

  • extract and prepare as a maven dependency for future projects
  • extract and prepare as a gradle dependency for future projects (and eventually android)
  • Package and publish to some Linux distros .deb .rpm .aur .apk (alpine Linux) Maybe Open Build Service? (and if easy also to dmg and exe)
  • HOT: Make suitable for shared libraries installation with osjslgit-* (Open Source Java Shared Libraries git installer) IMPORTANT: I have to project and write this app firstly ;) so check my public repos

Control:

  • Settings by text file
  • Control by arguments (example: java -jar /path/to/base.jar -download -sync)
  • 1 Unwanted Expressions saved in text file
  • Multiple workspaces (one main installation, and multiple configurations on whole filesystem)
  • Ability to revert default settings without deleting .hostliststools folder

Abilities:

  • Downloads the lists parallel
  • Merges The lists
  • Cleans The lists from unwanted expressions and duplicates
  • Cleans the lists from incorrect signs like x.x.x.x**;1 x.x.x.x/**1
  • Gives opt to read content to collection straight from site
  • Copes with html source "broken" lists like pgl yoyo which has webpage structure (not a file)
  • Adds chosen host lists characteristic expressions to clear records
  • Commits and pushes product file to repo (problem with java... working on it WORKAROUND: bash script)
  • Excludes records from local lists
  • Diffs changes between remote file and local one during downloading (idea compare hashes)
  • Includes local lists independently on -R or -D -M call (needs tree update?)
  • Maintains your own lists ? (I didn't think about it firstly, but why not)
  • Copes with dnsmasq structure (I made lambda for it but because of multiple if statements it slows significantly - needs new arg control?)
  • Converts lists from one format to another (like dnsmasq to 2 RBR)
  • SQL database support (alternative form to store records in text file)

Installation and updates:

  • Install script for Linux
  • update mechanism: Cron update script (or java app?, or inbuilt?)

Docs

  • Wiki
  • javadoc for lib that's gonna be extracted from gui and cli app

Cosmetic:

  • logo

3. Installation and first config

This is cool portable program that doesn't do any administrative activities which means that can be run from anywhere by anyone without sudo/admin privleges. It also has support for multiple workspace.

  1. Install Dependencies (for now manually)
  2. Configure git credentials (name/pubkey) in your OS
  3. Download latest .jar from release page
  4. Run without any arguments java jar /path/to/hostliststools-cli.x.x.x.jar Main workspace and settings file will be created after first run
  5. Create source lists in text file saved with any name, link by link with links to external lists you want merge (in $workdir/.hostliststools/src), records you want to exclude (in $workdir/.hostliststools/whitelist), records you want add but are not included in any list (in $workdir/.hostliststools/blacklist)
  6. Adjust parallel download jobs by running java -jar hostliststools-cli.jar -j <yourjobsnumber> (ofc without <>), this will be remembered so if you'll change your mind run it again (default=4)
  7. Enjoy! (check manual below for more information)

Dependencies:

  • Java Runtime Environment (JRE) >= 11 (I use openjdk, no need to oracles' one)
  • Native Git
  • bash script execution platform (needed for git support I made in bash and utils used there like grep)

Linux:

  • OpenSUSE: sudo zypper in -y java-17-openjdk git

  • Fedora: sudo dnf install -y java-latest-openjdk-headless git

  • Debian: sudo apt install -y openjdk-17-jre git

*BSD:

Windows:

Mac OSX:

  • brew install git openjdk@16

Example source list:

4. Manual

Guide

usage: usage: java -jar hostlistools-cli.jar [-c] [-D] [-E] [-F <arg>] [-H] [-h]
       [-I] [-i <arg>] [-j <arg>] [-K] [-M] [-o <arg>] [-p] [-R] [-S
       <arg>] [-w <arg>] [-X]
 -c,--clean-environment       removes files from $workspace/downloaded,
                              $workspace/out
 -D,--download                downloads hostlists
 -E,--save-excluded           saves excluded sets of badly formated lines
 -F,--fill-with <arg>         adds prefix and space before records like
                              0.0.0.0 example.com
 -H,--skip-heading            don't include heading
 -h,--help                    print avaliable options
 -I,--print-stats             prints statistic about line processing and
                              adds them to README
 -i,--input <arg>             specify custom input folder sourcelist
 -j,--download-jobs <arg>     paralell download jobs
 -K,--skip-sources            don't include sources
 -M,--merge                   merges and cleans the lists from local files
 -o,--output-filename <arg>   specify merged filename once, default:
                              Merged_$Date_$Time.txt
 -p,--permissive              disable recover rules with ;000
 -R,--read                    reads hostlists directly from web
 -S,--git-set-remote <arg>    set git remote path
 -w,--set-workpace <arg>      executes all tasks on pointed workspace and
                              saves it as last used
 -X,--print-settings          reads variables and prints it for user

Cookbook

If you want to use git:

After native git credentials configuration (name/pubkey) then You should create initialized repo on your git provider site andd run one of below command: java -jar /path/to/hostliststools-cli.jar -S ssh://git@someGit.com/user/repo.git - if you use git with ssh (it's cool, secure and painless after first config. I advise it) java -jar /path/to/hostliststools-cli.jar -S https://someGit.com/user/repo.git - if you use git with https and regular username and password

This command saves variable in settings file

Git logic in this app is:

  1. Shallow clone git clone --depth 1 repo
  2. Remove old, copy new updated files
  3. Add . => Commit => Push
  4. Delete git dir, create empty one.

It's because I've tried keeping all files history and it took more than 1 Gb for ~ 5'700'000 records after 4 commits

cd /path/to/hostliststools-cli.jar && java -jar hostliststools-cli.jar -c -R -I && bash .hostliststools/scripts/bash-git.sh

  1. Clears $workdir/.hostliststools/downloaded and $workdir/.hostliststools/out
  2. Reads records without downloading from given links then saves to file in $workdir/.hostliststools/out
  3. Generates and prints statistics about processing
  4. Copies merged file to $workdir/.hostliststools/git dir and commits changes
  5. Bash script pushes it to remote repo (if you have -I README gets overriten with same stats)

Important - FOR NOW bash script NEEDS to be run from /path/to/hostliststools-cli.jar dir, no other (It'll be improved in future with automatic searching and entering this dir)

cd /path/to/hostliststools-cli.jar && java -jar hostliststools-cli.jar -c -D -M -I && bash .hostliststools/scripts/bash-git.sh Same but instead of direct read from sites it downloads files to $workdir/.hostliststools/downloaded (this option exists for further work with diff tools but probably will be dropped behalf checking hashes)

This app is resource hungry it should be used with at least 4gb RAM and java -Xmx3g -jar (3g heaps size), for big data sets you'll probably have to experiment with -Xmx3g arg.

Without git:

java -jar /path/to/hostliststools-cli.jar -R -I

  1. Reads records without downloading from given links then saves to file in $workdir/.hostliststools/out
  2. Generates and prints statistics about processing

Automation

Linux

  • cron is your friend because (when you configure path and shell in it properly) you can decide on what hour, day in month commands are executed, so you can pass above commands and forget about hostlists problem ;)

Windows

  • TODO, You tell me

MAC OSX

  • TODO, You tell me

BSD

  • TODO, You tell me

5. Requests and collaboration

Pull requests accepted, please open issue with explanation of your plan :) If you want to become maintainer it's also great, just message me <TODO add communication channel>!

6. Technical information

Dude, why java?

Yeah, I asked myself this question too while playing around with tons of try-catch on every IOOperation... I tried python, go(lang) and bash but:

  • some input files have broken encoding (long files with multiple encodings!), python and bash throws exceptions while reading them. Java and go copes with it flawlessly and continues without throwing anything!

  • Python and go doesn't have inbuilt advanced streams support (I tried lazy-streams from PyPI) but it's not enough (e.g only to_list, no set, no distinct, no sort functions) and both made in go, one lacks parallelization, second FlatMap

  • Python lacks multi-line lambdas (needed to generate process stats). To boost string processing I did it by streams and lambdas (filtering, mapping etc.).

  • Javas TreeSet does sorting for me without need to call any other method like I have to in python.

  • In Java I have light htmlunit compared to python where I have to use selenium with its painful webdriver download (pssst! It also needs to be updated, so one more stand-alone thing to care about...)

  • GO(lang) seems not to have set inbuilt and community implementations are unusable if some function does only .ToSlice() and you have to foreach over all dataset to just rewrite it to this datastructure!

I really wish to do it in easy python or speed-master GO(lang) or Linux integrated bash. I love them but these are serious problems. And strangely Java seems to have strong and good, unbreakable IO, parallel lazy-streams with colletors .toSet() which is essential in this app. If you have better language with garbage collector, lazy-streams with toSet(), good unbreakable IO, set inbuilt, and htmlunit/selenium dependencies open an issue and let's rewrite it:)

Why native git?

JGit is absolutley ununderstandable, complex and hard to cope with (if you now how, please tell me...). I may be lacking something, whatever but I found it really hard to achieve simple clone --depth 1, add, commit, and push with both https and ssh.

Now, Seriously: It'll be easier to manage whole ecosystem for user, because there's no need to configure credentials second time in app. For now I'm satisfied with effect. I have hardcoded simple script that uses standard POSIX system utilities (bash and grep which reads variables from settings file)

Why htmlunit dependency?

Sites like yoyo has "broken" structure in terms of using in apps. When I download file from this link, I got webpage source which is <!doctype html> etc... BufferedReader reads line by line, my functions splits lines by tab or space. It's the safest option even with regex used to recognize if record is a link (comments in site structure can contains website which would be recognized as link to blacklist).

So when link contains .html or .php I load "browser without gui" which is htmlunit and saves processed html as string. It can be also solved with html parser with previous website download (I did it in my attempt to rewrite app to go)

Why bash script for git instead of e.g. python script?

Well, I decided to make it in bash as those are just ~50 lines that reads config from settings file, rm fodler, git clone, cp merge file and git add commit push. Python require user to have yet another interpreter which is probably not problem for coders. Also I'd have to play with secure credentials storing and additional libraries which increases size... bash gained not bad cross-platform support: it's inbuilt in most linux distros, mac and bsd are compatibile with it and in widows you have WSL. Also it integrates with system git config. I cannot integrate single commands with Process.exec() as occured bug: there's no synchronization between running commands, so shallow clone won't end and copying starts... Maybe it's some lib fault that executes async?

WHERE is this script, i see no bash in language summary?

It's hardcoded in GitWrapper class in wrapper package, it's getting created on first run in .hostliststools/scripts folder. I did it not to subordinate this app from remote sources.

7. Benchmarks

TODO

  • amd64 (< 2.0 GHz <) - For regular desktop to compare
  • aarch64 (<= 2.0 GHz)- For modern microcomputers like raspberry pi >= 3b+
  • armv7 (< 2.0 GHz)- For older microcomputers (3 or rpi if you waste your power and use raspbian instead of official debian / ubuntu / opensuse / fedora)

8. Explanation

  1. Unwanted Expressions - sometimes lists are saved in format 0.0.0.0 site.com as system host file. We cannot merge them in this state so we have to clear specified host lists expressions.
  2. RBR – shortcut: Record By Record
  3. This is shitty configuration, as raspbian is still 32 bit while rpi is 64 (You waste power of your machine and have support for less apps)! The funniest part is raspbian is build on debian, which is available here in 64bits aarch64 or similarly you have 64bit ubuntu also you have RPM based distors like openSUSE. WARNING you have to read their HCL pages as some functionalists (like pins, audio, camera) may be still WIP or may be not compatible with external hardware. But don't worry for 100% I checked that ubuntu is safe in that terms.

9. Disclaimer

I'm Linux fan and regular user so Linux will be always priority and will have better support and integration (especially when it comes to Native libraries/apps wrapping), I have spent hours reading about its structure, news and solving problems. I'm also an open-source fan.

I don't know BSD so I can't help in troubleshooting but I'll be happy to learn anything about it if something appears (It's quiet similar to Linux so It'll be easier for me)

I'm not fan of Windows or Mac OSX don't know their structure and I won't be wasting time for them so you have to find workarounds like WSL etc. Also, please submit pull request with explanation what's wrong or open discussion pointing external experts rather than ask me how to fix it ;)

This app is made without any warranty