January 2, 2020 - ai ai-ml
Web2Text - Deep Structured Boilerplate Removal - Running the Code

Xavier Geerinck
In this article I will explain you how you can run the [Web2Text] (https://github.com/dalab/web2text) demonstration code available as explained in their paper or presentation.
Prerequisites
Before we can run the entire pipeline, we first have to install some tools for this to work.
Checking the JAVA version
Make sure that you have java > 1.8 installed by checking this with java -version
Installing Scala SBT
- Download scala-sbt (https://www.scala-sbt.org/download.html)
- Note: I had to use 1.3.3 on Windows, 1.3.5 ad 1.3.4 seems broken
- In this I had to change
C:\Program Files (x86)\sbt\bin\sbt.bat
line 385 and replaceif x%g:^==% == x%g% (
withif "x%g:^==%" == "x%g%" (
- In this I had to change
- Note: I had to use 1.3.3 on Windows, 1.3.5 ad 1.3.4 seems broken
Installing Tensorflow
The Web2Text project utilizes Tensorflow to create a convolutional neural network. To utilize the trained model, it will utilize Tensorflow, so we should install it as well. To do this, we can run the following commands:
pip install numpy==1.18.0 tensorflow==1.15.0 tensorflow-gpu==1.15.0
Note: See that we are installing Tensorflow 1.15.0 and not 2.0.0, this is because the original Web2Text code is not up to date with the latest version yet. The error code being returned is that it's missing the
variable_scope
. This can potentially be resolved by utilizingts.compat.v1
Installing NVIDIA CUDA Toolkit
For Tensorflow to perform well, we have to install the NVIDIA CUDA Toolkit, this way we will be able to utilize GPU training / inferencing:
- Install https://developer.nvidia.com/cuda-toolkit-archive v10.0
- Note: >v10.0 doesn't work! see https://www.tensorflow.org/install/gpu#software_requirements
- Install https://developer.nvidia.com/rdp/cudnn-download for v10.0
- Note: open the .zip file and extract the content in the
cuda/
folder to theC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0
folder
- Note: open the .zip file and extract the content in the
As a last step, we can now adapt our PATH variable to include the NVIDIA toolkit:```bashSET PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin;%PATH%SET PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\extras\CUPTI\libx64;%PATH%SET PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\include;%PATH%SET PATH=C:\tools\cuda\bin;%PATH%
Downloading our Web2Text code
Download the source code with: git clone https://github.com/dalab/web2text.git
Running the code
After installing everything, we are now ready to run the code on our own HTML file:
- Navigate to root path of Web2Text
- Open CMD
- [CMD]
sbt "runMain ch.ethz.dalab.web2text.ExtractPageFeatures result/input.html result/step_1_extracted_features"
- Note: this will download some files
- Note: on windows there is a bug, see: https://github.com/sbt/sbt/issues/5222
- Files will now be visible in the root folder
- [CMD]
python src\main\python\main.py classify result\step_1_extracted_features result/step_2_classified_labels
- Note: first error I got was:
absl.flags._exceptions.UnparsedFlagAccessError: Trying to access flag --logtostderr before flags were parsed.
, this was resolved by openingsrc/main/python/config.py
and addingimport sys\nFLAGS(sys.argv)
underFLAGS = tf.app.flags.FLAGS
- Note: second error I got was:
tensorflow.python.framework.errors_impl.NotFoundError: FindFirstFile failed for: trained_model_cleaneval_split : The system cannot find the path specified.
- This was due to
os.path.join(CHECKPOINT_DIR, "unary.ckpt")
oros.path.join(CHECKPOINT_DIR, "edge.ckpt")
which do not take into account the running from the root directory. We can easily resolve this by addingos.path.dirname(__file__)
to the join parameters. Example:os.path.join(os.path.dirname(__file__), CHECKPOINT_DIR, "unary.ckpt")
- This was due to
- [CMD]
sbt "runMain ch.ethz.dalab.web2text.ApplyLabelsToPage result/input.html result/step_2_classified_labels result/step_3_applied_labels"
Running the code - Summarized
For people just wanting to run the code you can find a copy/paste example here :)
sbt "runMain ch.ethz.dalab.web2text.ExtractPageFeatures result/input.html result/step_1_extracted_features"python src\main\python\main.py classify result\step_1_extracted_features result/step_2_classified_labelssbt "runMain ch.ethz.dalab.web2text.ApplyLabelsToPage result/input.html result/step_2_classified_labels result/step_3_applied_labels"