Filesystem Indexing And Searching Walkthrough
This book assumes that you either have an existing Solr server with its managed schema already setup and running with
- An existing indexed dataset, or
- The datasets used are the pre-built .json data files in the sample directory
This part of the book starts with having no data and then using a source to programmatically create data in the Solr index.
Of course you could always create an export file of your data in one of Solr's supported formats and import it as the other examples in the book show.
The starter project is here:
https://github.com/synapticloop/panl-filesystem
If you want to skip ahead to the complete implementation, it is held on the completed branch:
https://github.com/synapticloop/panl-filesystem/tree/completed
Indexing A Dataset
There are two ways to provide Solr with data, either through a file (which is the method used in this book so far), or programmatically connecting with the Solr server and providing the data.
This example will use the latter method to provide data to the Solr server and provides a complete walkthrough of setting up new Panl and Solr servers and provide a search engine interface for files based on the Panl project.
|
Note: The project could be used for any file system path, the Panl project was chosen as the example file path to use. |
Project Requirements
The requirements of this filesystem search are as follows:
- An indexed and searchable system based on a filesystem path
- A keyword search will look at the filename and part of the file contents
- The search results should be able to be faceted on
- The file path (each individual path),
- The size of the file
- The type of the file (based on extension)
Whilst a simple example, this will provide insight into how to programmatically index data and surface this through the Panl server.
Project Steps
- Initialise a new project
- Download and configure the Solr Server
- Download and configure the Panl Server
- Write the indexing code
- Run the servers
- Search and facet on the results
1. Initialise a new project
A new project was created named panl-filesystem which you can clone from GitHub: https://github.com/synapticloop/panl-filesystem - note that the main branch is the starter branch, with an empty configuration.
The following directories were created (all relative from the project root):
- ./servers - to hold the downloaded versions of the Panl and Solr servers (Contents of this directory will be ignored by Git)
- ./config/panl - to hold the configuration items for Panl
- ./config/solr - to hold the configuration items for Solr
For the indexing of the documents, we require some dependencies:
- Apache TIKA - this library has an easy interface to index a wide range of documents
- SolrJ Connector - to connect with the Solr instance to programmatically index the individual documents
- Apache Commons IO - to recursively list the files in a directory
Consequently the following lines were added to the build.gradle dependencies section:
01 02 03 04 05 06 07 |
dependencies { implementation 'org.apache.solr:solr-solrj:9.8.0' implementation 'org.apache.tika:tika-core:3.0.0' implementation 'org.apache.tika:tika-parsers:3.0.0' implementation 'org.apache.tika:tika-parsers-standard-package:3.0.0' implementation 'commons-io:commons-io:2.16.1' } |
Finally, a new plugin was added to the gradle.properties file so that this can be run as an application:
01 02 03 04 |
plugins { id 'java' id 'application' } |
With the appropriate application configuration
01 02 03 |
application { mainClass = 'com.synapticloop.Main' } |
2. Download and Configure the Solr Server
To get the Solr server set up and running:
- Download the Solr Server
- Copy the Solr Configuration
- Edit the Managed Schema file
- Start the Solr Server
- Create the Solr Collection
a. Download the Solr Server
Download the latest version from https://solr.apache.org/downloads.html (this example is using the Solr 9.8.0-slim version) and extract it to the ./servers directory
Your final directory structure should be ./servers/solr-9.8.0-slim/
Note: that all sub-directories under the ./servers directory are ignored by Git (through the .gitignore file)
b. Copy the Solr Configuration
Copy the entire directory of
./servers/solr-9.8.0-slim/server/solr/configsets/_default/conf
to
./config/solr/
So that the ./config/solr directory and file structure is as follows:
./config/solr/lang/*.txt
./config/solr/managed-schema.xml
./config/solr/protwords.txt
./config/solr/solrconfig.xml
./config/solr/stopwords.txt
./config/solr/synonyms.txt
c. Edit the Managed Schema
The managed-schema.xml file is the one that defines the fields for Solr. Looking back at the project requirements, there are few fields that need to be indexed, and the Solr configuration reflects this.
The first item to edit is the schema name which should be changed from:
<schema name="default-config" version="1.7">
To:
<schema name="filesystem" version="1.7">
Next, the field definitions part of the managed-schema.xml file was edited to the following (the rest of the file has been excluded):
01
02
03 04
05
06
07
08
09 10 11 12 13 14 15 16 |
<field name="id" type="string" indexed="true" stored="true" required="true" ↩ <field name="_version_" type="plong" indexed="false" stored="false" ↩ docValues="true/>
<field name="filetype" type="string" indexed="true" stored="true" ↩ <field name="filesize" type="string" indexed="true" stored="true" ↩ <field name="category" type="string" indexed="true" stored="true" ↩ <field name="contents" type="text_general" indexed="true" stored="false" ↩
<field name="_text_" type="text_general" indexed="true" stored="false" ↩
<uniqueKey>id</uniqueKey>
<copyField source="filename" dest="_text_"/> <copyField source="contents" dest="_text_"/> |
Line 1:
This is the unique key for each indexed document - for this example we are using the full file path - See Line 11 for the Solr XML element that configures the id to use for the collection.
Line 2:
The _version_ field is used for cloud deployments.
Line 4:
The name of the file - note that this is also copied to an indexable field with the configuration on Line 13.
Line 5:
The type of the file (i.e. the file extension)
Line 6:
The size of the file - which will be derived and put into a string value - e.g. 1 kb, 10 kb, 10 mb, etc.
Line 7:
The category of the file (i.e. each of the individual file paths are added to the document as a 'category')
Line 8:
The text contents of the file which can be used to search against - note that this is not stored as the contents of the individual files can be too large to be placed into Solr - Additional note: there is a way around this by using a streaming parser for indexing, an example of which is not included in this book.
Line 10:
This is the holding field to place all content in that will be analysed for keyword searches.
Note: For the most recent version of Solr, the default search field and this Solr field name is now named '_text_' NOT 'text'
Line 13:
This defines the unique document key (see Line 1)
Line 15:
Copies the filename to the text index
Line 16:
Copies the content of the file to the text index
d. Start the Solr Server
The commands to start the Solr server in cloud mode are explained in detail in earlier chapters, here are the brief commands:
Windows
Command(s) |
servers\solr-9.8.0-slim\bin\solr.cmd start -e cloud --no-prompt |
*NIX
Command(s) |
servers/solr-9.8.0-slim/bin/solr start -e cloud --no-prompt |
e. Create the Solr Collection
The commands to create the collection in Solr are explained in detail in earlier chapters, here are the brief commands:
Windows
Command(s) |
servers\solr-9.8.0-slim\bin\solr.cmd create -c filesystem ↩ |
*NIX
Command(s) |
servers/solr-9.8.0-slim/bin/solr create -c filesystem -d config/solr ↩ --shards 2 -rf 2 |
Now that the Solr collection is configured, we can move on to setting up the Panl Server.
3. Download and Configure the Panl Server
a. Download the Panl Server
Download the latest version from https://github.com/synapticloop/panl/releases/ (this example is using the Panl 9-2.0.0 version) and extract it to the ./servers directory
Your final directory structure should be ./servers/solr-panl-9-2.0.0/
Note: that all directories under the ./servers directory are ignored by Git (through the .gitignore file)
b. Configure the panl.properties Files
For ease of implementation, the Panl generator will be used.
Windows
Command(s) |
servers\solr-panl-9-2.0.0\bin\panl.bat generate -properties ↩ |
*NIX
Command(s) |
servers/solr-panl-9-2.0.0/bin/panl generate -properties ↩ config/panl/panl.properties -schema config/solr/managed-schema.xml |
Which generates the following two files:
./config/panl/filesystem.panl.properties
./config/panl/panl.properties
c. Edit the Configuration files
The generator does a good job of getting things set up quickly, however there are a few tweaks that we want to make to the configuration.
- The panl.facet.t=_text_ should not be a facet - it should be a field, so we will change this to panl.field.t=_text_
- We will remove the LPSE code of 't' (i.e. the _text_) field above from:
- the panl.lpse.order property - we don't want to facet on it
- The panl.results.fields.default - we don't want the text to be returned with the results documents
- We will also remove the panl.results.fields.firstfive property - we will just use the default FieldSet
- We will update the panl.sort.fields property to include filename and filetype for sorting options.
- panl.facet.i=id should not be a facet - it should be a field, so we will change this to panl.field.i=id
- We will remove the LPSE code of 'i' (i.e. the id) field above from:
- the panl.lpse.order property - we don't want to facet on it
- Unlike the _text_ field, we do want to be able to see this value in the fields returned from the document, so we will not be altering the panl.results.fields.default property.
- panl.facet.f=filename should not be a facet - it should be a field, so we will change this to panl.field.f=filename
- We will remove the LPSE code of 'f' (i.e. the filename) field above from:
- the panl.lpse.order property - we don't want to facet on it
- Like the id field, we want to be able to see this value in the fields returned from the document, so we will not be altering the panl.results.fields.default property.
Apart from that, we are good to go.
d. (Optionally) Start the Panl Server
If you want, you can start up the Panl server, however, as there are no documents, it is going to be an empty (and therefore useless) page.
Windows
Command(s) |
servers\solr-panl-9-2.0.0\bin\panl server -properties ↩ |
*NIX
Command(s) |
servers/solr-panl-9-2.0.0/bin/panl server -properties ↩ config/panl/panl.properties |
If you open your browser to the following URL:
http://localhost:8181/panl-results-viewer/filesystem/default
Image: The In-Build Panl Results Viewer web app showing no results - as we haven't indexed any documents yet.
4. Write the Indexing Code
Now that we have both Solr and Panl setup, time to dive into the code - this will be fairly simple, it will:
- Have a single command line argument of the base path to start indexing
- Will recursively go through each of the found files and add the information to the Solr index.
- The indexed information is:
- The file name
- The file type (i.e. the extension)
- The file size (which is rounded to the nearest 500 in units of KB, MB, GB, or TB)
- The categories - these are based on the file path (i.e. the directory or folder structure)
- The contents - we will use Apache Tika to extract the contents
- We will be ignoring any files or directories that start with a '.' character and any files in the 'build' or 'gradle' directories.
The code comes in at around 170 lines (including comments) and can be found on the completed branch of the project:
It is very simplistic code just to show how to index documents in Solr and many extensions and speedups could be done to improve it.
The logic of the code is
- Spin up the parser checking to ensure that a directory path is passed through as a command line option
- Recursively find files from the directory
- For each file, extract the contents with Apache Tika and send the results to the Solr server.
The text of the file is included below for reference:
001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
package com.synapticloop;
import org.apache.commons.io.FileUtils; import org.apache.commons.io.filefilter.IOFileFilter; import org.apache.solr.client.solrj.SolrServerException; import org.apache.solr.client.solrj.impl.CloudHttp2SolrClient; import org.apache.solr.client.solrj.impl.CloudSolrClient; import org.apache.solr.common.SolrInputDocument; import org.apache.tika.Tika; import org.apache.tika.exception.TikaException;
import java.io.*; import java.util.HashSet; import java.util.List; import java.util.Set;
public class Main { // these are the directories to ignore private static final Set<String> IGNORE_DIRECTORIES = new HashSet<>();
static { IGNORE_DIRECTORIES.add("build"); IGNORE_DIRECTORIES.add("gradle"); }
// This is the file filter that is used for both files and directories private static final IOFileFilter FILE_FILTER = new IOFileFilter() { @Override public boolean accept(File file) { return (accept(file, file.getName())); }
@Override public boolean accept(File file, String fileName) { if (IGNORE_DIRECTORIES.contains(fileName)) { return (false); }
return (!fileName.startsWith(".")); } };
public static void main(String[] args) { if (args.length == 0) { throw new RuntimeException( "Expecting one argument of the base directory to index."); }
// try and find the passed in directory File baseDir = new File(args[0]); if (!baseDir.exists()) { throw new RuntimeException( "Base directory " + args[0] + " does not exist."); }
// at this point we are good to index files
for (File listFile : FileUtils.listFiles( baseDir, FILE_FILTER, FILE_FILTER )) { indexDocument(baseDir, listFile); } }
/** * <p>This does the heavy lifting of indexing the documents with Apache Tika, * then connecting to the Solr server to add the contents of the document and * metadata to the search collection index.</p> * * @param baseDir The base directory for starting the search indexing * @param listFile The file to be indexed */ private static void indexDocument(File baseDir, File listFile) { // get the SolrJ client connection to Solr CloudSolrClient client = new CloudHttp2SolrClient.Builder( List.of("http://localhost:8983/solr/")).build();
// Extract the information that we will be using for indexing String absolutePath = listFile.getAbsolutePath(); String filePath = absolutePath.substring( baseDir.getAbsolutePath().length(), absolutePath.lastIndexOf(File.separator) + 1); String fileName = listFile.getName();
String id = filePath + fileName; String fileType = fileName.substring(fileName.lastIndexOf(".") + 1);
// the following is done because the Windows file separator is a backslash // '\' which interferes with the regex parsing on Windows file systems // only. String[] categories = filePath.split( (File.separator.equals("\\") ? "\\\\" : File.separator) );
// a nicety to put them in the root directory if (categories.length == 0) { categories = new String[]{"ROOT_DIRECTORY"}; }
try { // get the contents automatically with the Tika parsing String contents = new Tika().parseToString(listFile);
// create the solr document that is going to be indexed SolrInputDocument doc = new SolrInputDocument();
// Add the fields to the document, the first parameter of the call is the // Solr field name - which must match the schema doc.addField("id", id); doc.addField("filename", fileName); doc.addField("filetype", fileType);
// now for the filesize
doc.addField("filesize", getFileSize(listFile.length()));
doc.addField("contents", contents); doc.addField("category", categories);
// now we add the document to the collection "filesystem", which must // match the collection that was defined in Solr client.add("filesystem", doc);
// now commit the changes to the filesystem Solr schema client.commit("filesystem");
System.out.println("Indexed file " + listFile.getAbsolutePath()); } catch (IOException | TikaException | SolrServerException e) { // something went wrong - we will ignore this System.out.println( "Could not index file " + listFile.getAbsolutePath() + ", message was: " + e.getMessage()); }
try { // don't forget to close the client client.close(); } catch (IOException e) { throw new RuntimeException(e); } }
private static String getFileSize(long length) { if (length <= 0) return "0 Bytes";
String[] units = new String[]{"Bytes", "KB", "MB", "GB", "TB"}; int unitIndex = 0;
double fileSize = (double) length;
while (fileSize >= 1024 && unitIndex < units.length - 1) { fileSize /= 1024; unitIndex++; }
return String.format( "%d %s", roundUpToNearest500(fileSize), units[unitIndex]); }
public static int roundUpToNearestThousand(double value) { return (int) (Math.ceil(value / 500) * 500); } } |
5. Run the servers
You may either run the project through your IDE of choice (don't forget to configure the command line argument to be the path that you want), or you can build and run it with gradle.
Windows
Command(s) |
gradlew.bat assemble |
*NIX
Command(s) |
./gradlew assemble |
Extract the distributable from the build/distributions file and execute the application with a command line parameter of the filesystem path.
For example, extracting the build/distributions/panl-filesystem-1.0.0.zip to the build/distributions directory, you will be able to run the command:
Windows
Command(s) |
build\distributions\panl-filesystem-1.0.0\bin\panl-filesystem ↩ c:\Users\Synapticloop\Projects\panl\ |
*NIX
Command(s) |
build/distributions/panl-filesystem-1.0.0/bin/panl-filesystem ↩ /Users/Synapticloop/Projects/panl/ |
6. Search and facet on the results
If you haven't already started up the Panl server, you may start the server with the following command.
Windows
Command(s) |
servers\solr-panl-9-2.0.0\bin\panl server -properties ↩ |
*NIX
Command(s) |
servers/solr-panl-9-2.0.0/bin/panl server -properties ↩ config/panl/panl.properties |
Navigate to:
http://localhost:8181/panl-results-viewer/filesystem/default
And search and facet on the results.
Image: The In-Build Panl Results Viewer web app showing the filesystem results.
Extending The Project
The project was set up as an easy way to go through the indexing of a filesystem and its files. It was deliberately set up to be simplistic, however there are plenty of extensions that could be made to the project, including how to delete a file from the index (where that file has been deleted from the filesystem, or has been moved to a new location) and adding more data to the index (e.g. file creation timestamp, last modified timestamp, whether the file is text or binary, the size of the file as a range, etc.).
The starter project (and the completed one) is a good base to start your programmatic indexing journey.
~ ~ ~ * ~ ~ ~