Filesystem Indexing And Searching Walkthrough

This book assumes that you either have an existing Solr server with its managed schema already setup and running with

  • An existing indexed dataset, or
  • The datasets used are the pre-built .json data files in the sample directory

This part of the book starts with having no data and then using a source to programmatically create data in the Solr index.

Of course you could always create an export file of your data in one of Solr's supported formats and import it as the other examples in the book show.

The starter project is here:

https://github.com/synapticloop/panl-filesystem

If you want to skip ahead to the complete implementation, it is held on the completed branch:

https://github.com/synapticloop/panl-filesystem/tree/completed

Indexing A Dataset

There are two ways to provide Solr with data, either through a file (which is the method used in this book so far), or programmatically connecting with the Solr server and providing the data.

This example will use the latter method to provide data to the Solr server and provides a complete walkthrough of setting up new Panl and Solr servers and provide a search engine interface for files based on the Panl project.  

Note: The project could be used for any file system path, the Panl project was chosen as the example file path to use.

Project Requirements

The requirements of this filesystem search are as follows:

  • An indexed and searchable system based on a filesystem path
  • A keyword search will look at the filename and part of the file contents
  • The search results should be able to be faceted on
  • The file path (each individual path),
  • The size of the file
  • The type of the file (based on extension)

Whilst a simple example, this will provide insight into how to programmatically index data and surface this through the Panl server.

Project Steps

  1. Initialise a new project
  2. Download and configure the Solr Server
  3. Download and configure the Panl Server
  4. Write the indexing code
  5. Run the servers
  6. Search and facet on the results

1. Initialise a new project

A new project was created named panl-filesystem which you can clone from GitHub: https://github.com/synapticloop/panl-filesystem - note that the main branch is the starter branch, with an empty configuration.

The following directories were created (all relative from the project root):

  • ./servers - to hold the downloaded versions of the Panl and Solr servers (Contents of this directory will be ignored by Git)
  • ./config/panl - to hold the configuration items for Panl
  • ./config/solr - to hold the configuration items for Solr

For the indexing of the documents, we require some dependencies:

  1. Apache TIKA - this library has an easy interface to index a wide range of documents
  2. SolrJ Connector - to connect with the Solr instance to programmatically index the individual documents
  3. Apache Commons IO - to recursively list the files in a directory

Consequently the following lines were added to the build.gradle dependencies section:

01

02

03

04

05

06

07

dependencies {

  implementation 'org.apache.solr:solr-solrj:9.8.0'

  implementation 'org.apache.tika:tika-core:3.0.0'

  implementation 'org.apache.tika:tika-parsers:3.0.0'

  implementation 'org.apache.tika:tika-parsers-standard-package:3.0.0'

  implementation 'commons-io:commons-io:2.16.1'

}

Finally, a new plugin was added to the gradle.properties file so that this can be run as an application:

01

02

03

04

plugins {

  id 'java'

  id 'application'

}


With the appropriate application configuration


01

02

03

application {

  mainClass = 'com.synapticloop.Main'

}


2. Download and Configure the Solr Server

To get the Solr server set up and running:

  1. Download the Solr Server
  2. Copy the Solr Configuration
  3. Edit the Managed Schema file
  4. Start the Solr Server
  5. Create the Solr Collection

a. Download the Solr Server

Download the latest version from https://solr.apache.org/downloads.html (this example is using the Solr 9.8.0-slim version) and extract it to the ./servers directory

Your final directory structure should be ./servers/solr-9.8.0-slim/

Note: that all sub-directories under the ./servers directory are ignored by Git (through the .gitignore file)

b. Copy the Solr Configuration

Copy the entire directory of

./servers/solr-9.8.0-slim/server/solr/configsets/_default/conf 

to

./config/solr/

So that the ./config/solr directory and file structure is as follows:

./config/solr/lang/*.txt
./config/solr/managed-schema.xml
./config/solr/protwords.txt
./config/solr/solrconfig.xml
./config/solr/stopwords.txt
./config/solr/synonyms.txt

c. Edit the Managed Schema

The managed-schema.xml file is the one that defines the fields for Solr.  Looking back at the project requirements, there are few fields that need to be indexed, and the Solr configuration reflects this.

The first item to edit is the schema name which should be changed from:

<schema name="default-config" version="1.7">

To:

<schema name="filesystem" version="1.7">

Next, the field definitions part of the managed-schema.xml file was edited to the following (the rest of the file has been excluded):

01

02

03

04

05

06

07

08

09

10

11

12

13

14

15

16

  <field name="id" type="string" indexed="true" stored="true" required="true" ↩
       
multiValued="false" />

  <field name="_version_" type="plong" indexed="false" stored="false" ↩

         docValues="true/>


 <field name="filename" type="string" indexed="true" stored="true" ↩
        multiValued="false" />

  <field name="filetype" type="string" indexed="true" stored="true" ↩
        multiValued="false" />

  <field name="filesize" type="string" indexed="true" stored="true" ↩
        multiValued="false" />

  <field name="category" type="string" indexed="true" stored="true" ↩
        multiValued="true" />

  <field name="contents" type="text_general" indexed="true" stored="false" ↩
        multiValued="false" />

  <field name="_text_" type="text_general" indexed="true" stored="false" ↩
        multiValued="true"/>

  <uniqueKey>id</uniqueKey>

  <copyField source="filename" dest="_text_"/>

  <copyField source="contents" dest="_text_"/>


Line 1:

This is the unique key for each indexed document - for this example we are using the full file path - See Line 11 for the Solr XML element that configures the id to use for the collection.

Line 2:

The _version_ field is used for cloud deployments.

Line 4:

The name of the file - note that this is also copied to an indexable field with the configuration on Line 13.

Line 5:

The type of the file (i.e. the file extension)

Line 6:

The size of the file - which will be derived and put into a string value - e.g. 1 kb, 10 kb, 10 mb, etc.

Line 7:

The category of the file (i.e. each of the individual file paths are added to the document as a 'category')

Line 8:

The text contents of the file which can be used to search against - note that this is not stored as the contents of the individual files can be too large to be placed into Solr - Additional note: there is a way around this by using a streaming parser for indexing, an example of which is not included in this book.

Line 10:

This is the holding field to place all content in that will be analysed for keyword searches.

Note: For the most recent version of Solr, the default search field and this Solr field name is now named '_text_' NOT 'text'

Line 13:

This defines the unique document key (see Line 1)

Line 15:

Copies the filename to the text index

Line 16:

Copies the content of the file to the text index

d. Start the Solr Server

The commands to start the Solr server in cloud mode are explained in detail in earlier chapters, here are the brief commands:

Windows

Command(s)

servers\solr-9.8.0-slim\bin\solr.cmd start -e cloud --no-prompt

*NIX

Command(s)

servers/solr-9.8.0-slim/bin/solr start -e cloud --no-prompt


e. Create the Solr Collection

The commands to create the collection in Solr are explained in detail in earlier chapters, here are the brief commands:

Windows

Command(s)

servers\solr-9.8.0-slim\bin\solr.cmd create -c filesystem  ↩
-d config\solr --shards 2 -rf 2


*NIX

Command(s)

servers/solr-9.8.0-slim/bin/solr create -c filesystem -d config/solr ↩

 --shards 2 -rf 2

Now that the Solr collection is configured, we can move on to setting up the Panl Server.

3. Download and Configure the Panl Server

a. Download the Panl Server

Download the latest version from https://github.com/synapticloop/panl/releases/ (this example is using the Panl 9-2.0.0 version) and extract it to the ./servers directory

Your final directory structure should be ./servers/solr-panl-9-2.0.0/

Note: that all directories under the ./servers directory are ignored by Git (through the .gitignore file)

b. Configure the panl.properties Files

For ease of implementation, the Panl generator will be used.

Windows

Command(s)

servers\solr-panl-9-2.0.0\bin\panl.bat generate -properties ↩
config\panl\panl.properties -schema config\solr\managed-schema.xml


*NIX

Command(s)

servers/solr-panl-9-2.0.0/bin/panl generate -properties ↩

 config/panl/panl.properties -schema config/solr/managed-schema.xml

Which generates the following two files:

./config/panl/filesystem.panl.properties
./config/panl/panl.properties

c. Edit the Configuration files

The generator does a good job of getting things set up quickly, however there are a few tweaks that we want to make to the configuration.

  1. The panl.facet.t=_text_ should not be a facet - it should be a field, so we will change this to panl.field.t=_text_
  2. We will remove the LPSE code of 't' (i.e. the _text_) field above from:
  1. the panl.lpse.order property - we don't want to facet on it
  2. The panl.results.fields.default - we don't want the text to be returned with the results documents
  1. We will also remove the panl.results.fields.firstfive property - we will just use the default FieldSet
  2. We will update the panl.sort.fields property to include filename and filetype for sorting options.
  3. panl.facet.i=id should not be a facet - it should be a field, so we will change this to panl.field.i=id 
  4. We will remove the LPSE code of 'i' (i.e. the id) field above from:
  1. the panl.lpse.order property - we don't want to facet on it
  2. Unlike the _text_ field, we do want to be able to see this value in the fields returned from the document, so we will not be altering the panl.results.fields.default property.
  1. panl.facet.f=filename should not be a facet - it should be a field, so we will change this to panl.field.f=filename
  2. We will remove the LPSE code of 'f' (i.e. the filename) field above from:
  1. the panl.lpse.order property - we don't want to facet on it
  2. Like the id field, we want to be able to see this value in the fields returned from the document, so we will not be altering the panl.results.fields.default property.

Apart from that, we are good to go.

d. (Optionally) Start the Panl Server

If you want, you can start up the Panl server, however, as there are no documents, it is going to be an empty (and therefore useless) page.  

Windows

Command(s)

servers\solr-panl-9-2.0.0\bin\panl server -properties ↩
config\panl\panl.properties


*NIX

Command(s)

servers/solr-panl-9-2.0.0/bin/panl server -properties ↩

 config/panl/panl.properties


If you open your browser to the following URL:

http://localhost:8181/panl-results-viewer/filesystem/default



Image: The In-Build Panl Results Viewer web app showing no results - as we haven't indexed any documents yet.


4. Write the Indexing Code

Now that we have both Solr and Panl setup, time to dive into the code - this will be fairly simple, it will:

  • Have a single command line argument of the base path to start indexing
  • Will recursively go through each of the found files and add the information to the Solr index.  
  • The indexed information is:
  • The file name
  • The file type (i.e. the extension)
  • The file size (which is rounded to the nearest 500 in units of KB, MB, GB, or TB)
  • The categories - these are based on the file path (i.e. the directory or folder structure)
  • The contents - we will use Apache Tika to extract the contents
  • We will be ignoring any files or directories that start with a '.' character and any files in the 'build' or 'gradle' directories.

The code comes in at around 170 lines (including comments) and can be found on the completed branch of the project:

https://github.com/synapticloop/panl-filesystem/blob/completed/src/main/java/com/synapticloop/Main.java

It is very simplistic code just to show how to index documents in Solr and many extensions and speedups could be done to improve it.

The logic of the code is

  1. Spin up the parser checking to ensure that a directory path is passed through as a command line option
  2. Recursively find files from the directory
  3. For each file, extract the contents with Apache Tika and send the results to the Solr server.

The text of the file is included below for reference:

001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

090

091

092

093

094

095

096

097

098

099

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

package com.synapticloop;

import org.apache.commons.io.FileUtils;

import org.apache.commons.io.filefilter.IOFileFilter;

import org.apache.solr.client.solrj.SolrServerException;

import org.apache.solr.client.solrj.impl.CloudHttp2SolrClient;

import org.apache.solr.client.solrj.impl.CloudSolrClient;

import org.apache.solr.common.SolrInputDocument;

import org.apache.tika.Tika;

import org.apache.tika.exception.TikaException;

import java.io.*;

import java.util.HashSet;

import java.util.List;

import java.util.Set;

public class Main {

  // these are the directories to ignore

  private static final Set<String> IGNORE_DIRECTORIES = new HashSet<>();

  static {

    IGNORE_DIRECTORIES.add("build");

    IGNORE_DIRECTORIES.add("gradle");

  }

  // This is the file filter that is used for both files and directories

  private static final IOFileFilter FILE_FILTER = new IOFileFilter() {

    @Override public boolean accept(File file) {

      return (accept(file, file.getName()));

    }

    @Override public boolean accept(File file, String fileName) {

      if (IGNORE_DIRECTORIES.contains(fileName)) {

        return (false);

      }

      return (!fileName.startsWith("."));

    }

  };

  public static void main(String[] args) {

    if (args.length == 0) {

      throw new RuntimeException(

          "Expecting one argument of the base directory to index.");

    }

    // try and find the passed in directory

    File baseDir = new File(args[0]);

    if (!baseDir.exists()) {

      throw new RuntimeException(

          "Base directory " +

          args[0] +

          " does not exist.");

    }

    // at this point we are good to index files

    for (File listFile : FileUtils.listFiles(

        baseDir,

        FILE_FILTER,

        FILE_FILTER

    )) {

      indexDocument(baseDir, listFile);

    }

  }

  /**

   * <p>This does the heavy lifting of indexing the documents with Apache Tika,

   * then connecting to the Solr server to add the contents of the document and

   * metadata to the search collection index.</p>

   *

   * @param baseDir The base directory for starting the search indexing

   * @param listFile The file to be indexed

   */

  private static void indexDocument(File baseDir, File listFile) {

    // get the SolrJ client connection to Solr

    CloudSolrClient client = new CloudHttp2SolrClient.Builder(

        List.of("http://localhost:8983/solr/")).build();

    // Extract the information that we will be using for indexing

    String absolutePath = listFile.getAbsolutePath();

    String filePath =

        absolutePath.substring(

            baseDir.getAbsolutePath().length(),

            absolutePath.lastIndexOf(File.separator) + 1);

    String fileName = listFile.getName();

    String id = filePath + fileName;

    String fileType = fileName.substring(fileName.lastIndexOf(".") + 1);

    // the following is done because the Windows file separator is a backslash

    // '\' which interferes with the regex parsing on Windows file systems

    // only.

    String[] categories = filePath.split(

        (File.separator.equals("\\") ? "\\\\" : File.separator)

    );

    // a nicety to put them in the root directory

    if (categories.length == 0) {

      categories = new String[]{"ROOT_DIRECTORY"};

    }

    try {

      // get the contents automatically with the Tika parsing

      String contents = new Tika().parseToString(listFile);

      // create the solr document that is going to be indexed

      SolrInputDocument doc = new SolrInputDocument();

      // Add the fields to the document, the first parameter of the call is the

      // Solr field name - which must match the schema

      doc.addField("id", id);

      doc.addField("filename", fileName);

      doc.addField("filetype", fileType);

      // now for the filesize

      doc.addField("filesize", getFileSize(listFile.length()));

      doc.addField("contents", contents);

      doc.addField("category", categories);

      // now we add the document to the collection "filesystem", which must

      // match the collection that was defined in Solr

      client.add("filesystem", doc);

      // now commit the changes to the filesystem Solr schema

      client.commit("filesystem");

      System.out.println("Indexed file " + listFile.getAbsolutePath());

    } catch (IOException | TikaException | SolrServerException e) {

      // something went wrong - we will ignore this

      System.out.println(

          "Could not index file " +

              listFile.getAbsolutePath() +

              ", message was: " +

              e.getMessage());

    }

    try {

      // don't forget to close the client

      client.close();

    } catch (IOException e) {

      throw new RuntimeException(e);

    }

  }

  private static String getFileSize(long length) {

    if (length <= 0) return "0 Bytes";

    String[] units = new String[]{"Bytes", "KB", "MB", "GB", "TB"};

    int unitIndex = 0;

    double fileSize = (double) length;

    while (fileSize >= 1024 && unitIndex < units.length - 1) {

      fileSize /= 1024;

      unitIndex++;

    }

    return String.format(

      "%d %s",

      roundUpToNearest500(fileSize), units[unitIndex]);

    }

  public static int roundUpToNearestThousand(double value) {

    return (int) (Math.ceil(value / 500) * 500);

  }

}

5. Run the servers

You may either run the project through your IDE of choice (don't forget to configure the command line argument to be the path that you want), or you can build and run it with gradle.

Windows

Command(s)

gradlew.bat assemble


*NIX

Command(s)

./gradlew assemble


Extract the distributable from the build/distributions file and execute the application with a command line parameter of the filesystem path.

For example, extracting the build/distributions/panl-filesystem-1.0.0.zip to the build/distributions directory, you will be able to run the command:

Windows

Command(s)

build\distributions\panl-filesystem-1.0.0\bin\panl-filesystem ↩

 c:\Users\Synapticloop\Projects\panl\


*NIX

Command(s)

build/distributions/panl-filesystem-1.0.0/bin/panl-filesystem ↩

 /Users/Synapticloop/Projects/panl/

6. Search and facet on the results

If you haven't already started up the Panl server, you may start the server with the following command.

Windows

Command(s)

servers\solr-panl-9-2.0.0\bin\panl server -properties ↩
config\panl\panl.properties


*NIX

Command(s)

servers/solr-panl-9-2.0.0/bin/panl server -properties ↩

 config/panl/panl.properties


Navigate to:

http://localhost:8181/panl-results-viewer/filesystem/default

And search and facet on the results.



Image: The In-Build Panl Results Viewer web app showing the filesystem results.

Extending The Project

The project was set up as an easy way to go through the indexing of a filesystem and its files.  It was deliberately set up to be simplistic, however there are plenty of extensions that could be made to the project, including how to delete a file from the index (where that file has been deleted from the filesystem, or has been moved to a new location) and adding more data to the index (e.g. file creation timestamp, last modified timestamp, whether the file is text or binary, the size of the file as a range, etc.).

The starter project (and the completed one) is a good base to start your programmatic indexing journey.

~ ~ ~ * ~ ~ ~