Web2Text - Deep Structured Boilerplate Removal - Running the Code

In this article I will explain you how you can run the [Web2Text] (https://github.com/dalab/web2text) demonstration code available as explained in their paper or presentation.


Before we can run the entire pipeline, we first have to install some tools for this to work.

Checking the JAVA version

Make sure that you have java > 1.8 installed by checking this with java -version

Installing Scala SBT

  1. Download scala-sbt (https://www.scala-sbt.org/download.html)
    • Note: I had to use 1.3.3 on Windows, 1.3.5 ad 1.3.4 seems broken
      • In this I had to change C:\Program Files (x86)\sbt\bin\sbt.bat line 385 and replace if x%g:^==% == x%g% ( with if "x%g:^==%" == "x%g%" (

Installing Tensorflow

The Web2Text project utilizes Tensorflow to create a convolutional neural network. To utilize the trained model, it will utilize Tensorflow, so we should install it as well. To do this, we can run the following commands:

pip install numpy==1.18.0 tensorflow==1.15.0 tensorflow-gpu==1.15.0

Note: See that we are installing Tensorflow 1.15.0 and not 2.0.0, this is because the original Web2Text code is not up to date with the latest version yet. The error code being returned is that it’s missing the variable_scope. This can potentially be resolved by utilizing ts.compat.v1

Installing NVIDIA CUDA Toolkit

For Tensorflow to perform well, we have to install the NVIDIA CUDA Toolkit, this way we will be able to utilize GPU training / inferencing:

  1. Install https://developer.nvidia.com/cuda-toolkit-archive v10.0
    • Note: >v10.0 doesn’t work! see https://www.tensorflow.org/install/gpu#software_requirements
  2. Install https://developer.nvidia.com/rdp/cudnn-download for v10.0
    • Note: open the .zip file and extract the content in the cuda/ folder to the C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0 folder ```

As a last step, we can now adapt our PATH variable to include the NVIDIA toolkit:

SET PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin;%PATH%
SET PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\extras\CUPTI\libx64;%PATH%
SET PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\include;%PATH%
SET PATH=C:\tools\cuda\bin;%PATH%

Downloading our Web2Text code

Download the source code with: git clone https://github.com/dalab/web2text.git

Running the code

After installing everything, we are now ready to run the code on our own HTML file:

  1. Navigate to root path of Web2Text
  2. Open CMD
  3. [CMD] sbt "runMain ch.ethz.dalab.web2text.ExtractPageFeatures result/input.html result/step_1_extracted_features"
    • Note: this will download some files
    • Note: on windows there is a bug, see: https://github.com/sbt/sbt/issues/5222
  4. Files will now be visible in the root folder
  5. [CMD] python src\main\python\main.py classify result\step_1_extracted_features result/step_2_classified_labels
    • Note: first error I got was: absl.flags._exceptions.UnparsedFlagAccessError: Trying to access flag --logtostderr before flags were parsed., this was resolved by opening src/main/python/config.py and adding import sys\nFLAGS(sys.argv) under FLAGS = tf.app.flags.FLAGS
    • Note: second error I got was: tensorflow.python.framework.errors_impl.NotFoundError: FindFirstFile failed for: trained_model_cleaneval_split : The system cannot find the path specified.
    • This was due to os.path.join(CHECKPOINT_DIR, "unary.ckpt") or os.path.join(CHECKPOINT_DIR, "edge.ckpt") which do not take into account the running from the root directory. We can easily resolve this by adding os.path.dirname(__file__) to the join parameters. Example: os.path.join(os.path.dirname(__file__), CHECKPOINT_DIR, "unary.ckpt")
  6. [CMD] sbt "runMain ch.ethz.dalab.web2text.ApplyLabelsToPage result/input.html result/step_2_classified_labels result/step_3_applied_labels"

Running the code - Summarized

For people just wanting to run the code you can find a copy/paste example here :)

sbt "runMain ch.ethz.dalab.web2text.ExtractPageFeatures result/input.html result/step_1_extracted_features"
python src\main\python\main.py classify result\step_1_extracted_features result/step_2_classified_labels
sbt "runMain ch.ethz.dalab.web2text.ApplyLabelsToPage result/input.html result/step_2_classified_labels result/step_3_applied_labels"

Xavier Geerinck

Xavier works as a Cloud Solution Architect at Microsoft, helping its customer unlock the full potential of the cloud. Even though he is still considered a young graduate, he achieved his first success at the age 16, by creating and selling his first startup. He then took this knowledge to create and help more startups in different markets such as technology, social media, philanthropy and home care. While in the meantime gaining more enterprise insights at renowned enterprises such as Nokia, Cisco and now Microsoft.

Read More