Web2Text - Deep Structured Boilerplate Removal - Running the Code

Xavier Geerinck

January 02, 2020 / ai ai-ml

In this article I will explain you how you can run the [Web2Text] (https://github.com/dalab/web2text) demonstration code available as explained in their paper or presentation.

Prerequisites

Before we can run the entire pipeline, we first have to install some tools for this to work.

Checking the JAVA version

Make sure that you have java > 1.8 installed by checking this with java -version

Installing Scala SBT

  1. Download scala-sbt (https://www.scala-sbt.org/download.html)

    • Note: I had to use 1.3.3 on Windows, 1.3.5 ad 1.3.4 seems broken

      • In this I had to change C:\Program Files (x86)\sbt\bin\sbt.bat line 385 and replace if x%g:^==% == x%g% ( with if "x%g:^==%" == "x%g%" (

Installing Tensorflow

The Web2Text project utilizes Tensorflow to create a convolutional neural network. To utilize the trained model, it will utilize Tensorflow, so we should install it as well. To do this, we can run the following commands:

pip install numpy==1.18.0 tensorflow==1.15.0 tensorflow-gpu==1.15.0

Note: See that we are installing Tensorflow 1.15.0 and not 2.0.0, this is because the original Web2Text code is not up to date with the latest version yet. The error code being returned is that it’s missing the variable_scope. This can potentially be resolved by utilizing ts.compat.v1

Installing NVIDIA CUDA Toolkit

For Tensorflow to perform well, we have to install the NVIDIA CUDA Toolkit, this way we will be able to utilize GPU training / inferencing:

  1. Install https://developer.nvidia.com/cuda-toolkit-archive v10.0

  2. Install https://developer.nvidia.com/rdp/cudnn-download for v10.0

    • Note: open the .zip file and extract the content in the cuda/ folder to the C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0 folder

As a last step, we can now adapt our PATH variable to include the NVIDIA toolkit:

SET PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin;%PATH%
SET PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\extras\CUPTI\libx64;%PATH%
SET PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\include;%PATH%
SET PATH=C:\tools\cuda\bin;%PATH%

Downloading our Web2Text code

Download the source code with: git clone https://github.com/dalab/web2text.git

Running the code

After installing everything, we are now ready to run the code on our own HTML file:

  1. Navigate to root path of Web2Text
  2. Open CMD
  3. [CMD] sbt "runMain ch.ethz.dalab.web2text.ExtractPageFeatures result/input.html result/step_1_extracted_features"

  4. Files will now be visible in the root folder
  5. [CMD] python src\main\python\main.py classify result\step_1_extracted_features result/step_2_classified_labels
  6. Note: first error I got was: absl.flags._exceptions.UnparsedFlagAccessError: Trying to access flag --logtostderr before flags were parsed., this was resolved by opening src/main/python/config.py and adding import sys\nFLAGS(sys.argv) under FLAGS = tf.app.flags.FLAGS
  7. Note: second error I got was: tensorflow.python.framework.errors_impl.NotFoundError: FindFirstFile failed for: trained_model_cleaneval_split : The system cannot find the path specified.

    • This was due to os.path.join(CHECKPOINT_DIR, "unary.ckpt") or os.path.join(CHECKPOINT_DIR, "edge.ckpt") which do not take into account the running from the root directory. We can easily resolve this by adding os.path.dirname(__file__) to the join parameters. Example: os.path.join(os.path.dirname(__file__), CHECKPOINT_DIR, "unary.ckpt")
  8. [CMD] sbt "runMain ch.ethz.dalab.web2text.ApplyLabelsToPage result/input.html result/step_2_classified_labels result/step_3_applied_labels"

Running the code - Summarized

For people just wanting to run the code you can find a copy/paste example here :)

sbt "runMain ch.ethz.dalab.web2text.ExtractPageFeatures result/input.html result/step_1_extracted_features"
python src\main\python\main.py classify result\step_1_extracted_features result/step_2_classified_labels
sbt "runMain ch.ethz.dalab.web2text.ApplyLabelsToPage result/input.html result/step_2_classified_labels result/step_3_applied_labels"

Xavier Geerinck © 2020

Twitter - LinkedIn