Web2Text - Deep Structured Boilerplate Removal - Running the Code

In this article I will explain you how you can run the Web2Text demonstration code available as explained in their paper or presentation.

Prerequisites

Before we can run the entire pipeline, we first have to install some tools for this to work.

Checking the JAVA version

Make sure that you have java > 1.8 installed by checking this with java -version

Installing Scala SBT

  1. Download scala-sbt (https://www.scala-sbt.org/download.html)
  • Note: I had to use 1.3.3 on Windows, 1.3.5 ad 1.3.4 seems broken
  • In this I had to change C:\Program Files (x86)\sbt\bin\sbt.bat line 385 and replace if x%g:^==% == x%g% ( with if "x%g:^==%" == "x%g%" (

Installing Tensorflow

The Web2Text project utilizes Tensorflow to create a convolutional neural network. To utilize the trained model, it will utilize Tensorflow, so we should install it as well. To do this, we can run the following commands:

pip install numpy==1.18.0 tensorflow==1.15.0 tensorflow-gpu==1.15.0
Note: See that we are installing Tensorflow 1.15.0 and not 2.0.0, this is because the original Web2Text code is not up to date with the latest version yet. The error code being returned is that it's missing the variable_scope. This can potentially be resolved by utilizing ts.compat.v1)

Installing NVIDIA CUDA Toolkit

For Tensorflow to perform well, we have to install the NVIDIA CUDA Toolkit, this way we will be able to utilize GPU training / inferencing:

  1. Install https://developer.nvidia.com/cuda-toolkit-archive v10.0
  • Note: >v10.0 doesn't work! see https://www.tensorflow.org/install/gpu#software_requirements
  1. Install https://developer.nvidia.com/rdp/cudnn-download for v10.0
  • Note: open the .zip file and extract the content in the cuda/ folder to the C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0 folder
As a last step, we can now adapt our PATH variable to include the NVIDIA toolkit:

bash SET PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin;%PATH% SET PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\extras\CUPTI\libx64;%PATH% SET PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\include;%PATH% SET PATH=C:\tools\cuda\bin;%PATH%

### Downloading our Web2Text code

Download the source code with: `git clone https://github.com/dalab/web2text.git`

## Running the code

After installing everything, we are now ready to run the code on our own HTML file:

1. Navigate to root path of Web2Text
2. Open CMD
3. [CMD] `sbt "runMain ch.ethz.dalab.web2text.ExtractPageFeatures result/input.html result/step_1_extracted_features"`
    * Note: this will download some files
    * Note: on windows there is a bug, see: https://github.com/sbt/sbt/issues/5222
4. Files will now be visible in the root folder
5. [CMD] `python src\main\python\main.py classify result\step_1_extracted_features result/step_2_classified_labels`
* Note: first error I got was: `absl.flags._exceptions.UnparsedFlagAccessError: Trying to access flag --logtostderr before flags were parsed.`, this was resolved by opening `src/main/python/config.py` and adding `import sys\nFLAGS(sys.argv)` under `FLAGS = tf.app.flags.FLAGS`
* Note: second error I got was: `tensorflow.python.framework.errors_impl.NotFoundError: FindFirstFile failed for: trained_model_cleaneval_split : The system cannot find the path specified.`
    * This was due to `os.path.join(CHECKPOINT_DIR, "unary.ckpt")` or `os.path.join(CHECKPOINT_DIR, "edge.ckpt")` which do not take into account the running from the root directory. We can easily resolve this by adding `os.path.dirname(__file__)` to the join parameters. Example: `os.path.join(os.path.dirname(__file__), CHECKPOINT_DIR, "unary.ckpt")`
6. [CMD] `sbt "runMain ch.ethz.dalab.web2text.ApplyLabelsToPage result/input.html result/step_2_classified_labels result/step_3_applied_labels"`

## Running the code - Summarized

For people just wanting to run the code you can find a copy/paste example here :)

bash sbt "runMain ch.ethz.dalab.web2text.ExtractPageFeatures result/input.html result/step1extractedfeatures" python src\main\python\main.py classify result\step1extractedfeatures result/step2classifiedlabels sbt "runMain ch.ethz.dalab.web2text.ApplyLabelsToPage result/input.html result/step2classifiedlabels result/step3applied_labels" ```