Hi Arturas, 

Here are some things to try :

1) HTMLStripCharFilter stripper = new 
HTMLStripCharFilter(strReader.markSupported() ? strReader : new 
BufferedReader(strReader))

2) Consider using HTML Strip update processor factory. 

3) Create a custom Lucene analyzer using html strip char filter and white space 
tokenizer. Use the "invoking the analyzer" example given in 
http://lucene.apache.org/core/7_4_0/core/org/apache/lucene/analysis/package-summary.html

Ahmet



On Thursday, July 5, 2018, 9:53:58 AM GMT+3, Arturas Mazeika 
<maze...@gmail.com> wrote:





Hi Solr Folk,

What would be the easiest way to use some of the Solr and Lucene components
in SolrJ?

I am pretty amazed how much thought and careful engineering went into some
individual components to cover the wild real world effectively. And I
wonder whether one could re-use some of them in other context.

At the bottom, I wanted to strip the HTML code and store the output in solr
(with different reasons behind [0]). I approached the problem
pragmatically: googled with "HTMLStripCharFilter and example", got to [1].
checked which jar I need for that (solr-core), googled for pom dependencies
[2]. and integrated this into my solrj app:

                    StringReader strReader = new StringReader(content);
                    HTMLStripCharFilter stripper = new
HTMLStripCharFilter(new BufferedReader(strReader));
                    StringBuilder o = new StringBuilder();
                    char[] cbuf = new char[1024 * 10];
                    while (true) {
                        int count = stripper.read(cbuf);
                        if (count == -1)
                            break; // end of stream mark is -1
                        if (count > 0)
                            o.append(cbuf, 0, count);
                    }
                    stripper.close();
                    doc.addField("content_stripped", o.toString());


Dependencies were downloaded [3], and if I start the program nothing
happens (I have a feeling that a web server is being started).

Comments?

Cheers,
Arturas

References

[0] Reasons may vary from optimizing highlighting of the text for the end
user to exposing oneself to individual components of solr at the deepest
level, analysis of impact to algorithms like machine learning or data
management

[1]
https://www.programcreek.com/java-api-examples/index.php?api=org.apache.lucene.analysis.charfilter.HTMLStripCharFilter

[2] pom.xml:

  <dependencies>
        <dependency>
            <groupId>org.apache.solr</groupId>
            <artifactId>solr-solrj</artifactId>
            <version>7.3.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.solr</groupId>
            <artifactId>solr-core</artifactId>
            <version>7.3.0</version>
        </dependency>
    </dependencies>

[3]Included Jars:
hppc-0.7.3.jar already exists in destination.
jackson-annotations-2.5.4.jar already exists in destination.
jackson-core-2.5.4.jar already exists in destination.
jackson-databind-2.5.4.jar already exists in destination.
jackson-dataformat-smile-2.5.4.jar already exists in destination.
caffeine-2.4.0.jar already exists in destination.
guava-14.0.1.jar already exists in destination.
protobuf-java-3.1.0.jar already exists in destination.
t-digest-3.1.jar already exists in destination.
commons-cli-1.2.jar already exists in destination.
commons-codec-1.10.jar already exists in destination.
commons-collections-3.2.2.jar already exists in destination.
commons-configuration-1.6.jar already exists in destination.
commons-fileupload-1.3.2.jar already exists in destination.
commons-io-2.5.jar already exists in destination.
commons-lang-2.6.jar already exists in destination.
dom4j-1.6.1.jar already exists in destination.
gmetric4j-1.0.7.jar already exists in destination.
metrics-core-3.2.2.jar already exists in destination.
metrics-ganglia-3.2.2.jar already exists in destination.
metrics-graphite-3.2.2.jar already exists in destination.
metrics-jetty9-3.2.2.jar already exists in destination.
metrics-jvm-3.2.2.jar already exists in destination.
javax.servlet-api-3.1.0.jar already exists in destination.
tools.jar already exists in destination.
joda-time-2.2.jar already exists in destination.
log4j-1.2.17.jar already exists in destination.
eigenbase-properties-1.1.5.jar already exists in destination.
antlr4-runtime-4.5.1-1.jar already exists in destination.
calcite-core-1.13.0.jar already exists in destination.
calcite-linq4j-1.13.0.jar already exists in destination.
avatica-core-1.10.0.jar already exists in destination.
commons-exec-1.3.jar already exists in destination.
commons-lang3-3.6.jar already exists in destination.
commons-math3-3.6.1.jar already exists in destination.
curator-client-2.8.0.jar already exists in destination.
curator-framework-2.8.0.jar already exists in destination.
curator-recipes-2.8.0.jar already exists in destination.
hadoop-annotations-2.7.4.jar already exists in destination.
hadoop-auth-2.7.4.jar already exists in destination.
hadoop-common-2.7.4.jar already exists in destination.
hadoop-hdfs-2.7.4.jar already exists in destination.
htrace-core-3.2.0-incubating.jar already exists in destination.
httpclient-4.5.3.jar already exists in destination.
httpcore-4.4.6.jar already exists in destination.
httpmime-4.5.3.jar already exists in destination.
lucene-analyzers-common-7.3.0.jar already exists in destination.
lucene-analyzers-kuromoji-7.3.0.jar already exists in destination.
lucene-analyzers-phonetic-7.3.0.jar already exists in destination.
lucene-backward-codecs-7.3.0.jar already exists in destination.
lucene-classification-7.3.0.jar already exists in destination.
lucene-codecs-7.3.0.jar already exists in destination.
lucene-core-7.3.0.jar already exists in destination.
lucene-expressions-7.3.0.jar already exists in destination.
lucene-grouping-7.3.0.jar already exists in destination.
lucene-highlighter-7.3.0.jar already exists in destination.
lucene-join-7.3.0.jar already exists in destination.
lucene-memory-7.3.0.jar already exists in destination.
lucene-misc-7.3.0.jar already exists in destination.
lucene-queries-7.3.0.jar already exists in destination.
lucene-queryparser-7.3.0.jar already exists in destination.
lucene-sandbox-7.3.0.jar already exists in destination.
lucene-spatial-extras-7.3.0.jar already exists in destination.
lucene-spatial3d-7.3.0.jar already exists in destination.
lucene-suggest-7.3.0.jar already exists in destination.
solr-core-7.3.0.jar already exists in destination.
solr-solrj-7.3.0.jar already exists in destination.
zookeeper-3.4.11.jar already exists in destination.
jackson-core-asl-1.9.13.jar already exists in destination.
jackson-mapper-asl-1.9.13.jar already exists in destination.
commons-compiler-2.7.6.jar already exists in destination.
janino-2.7.6.jar already exists in destination.
stax2-api-3.1.4.jar already exists in destination.
woodstox-core-asl-4.4.1.jar already exists in destination.
jetty-continuation-9.4.8.v20171121.jar already exists in destination.
jetty-deploy-9.4.8.v20171121.jar already exists in destination.
jetty-http-9.4.8.v20171121.jar already exists in destination.
jetty-io-9.4.8.v20171121.jar already exists in destination.
jetty-jmx-9.4.8.v20171121.jar already exists in destination.
jetty-rewrite-9.4.8.v20171121.jar already exists in destination.
jetty-security-9.4.8.v20171121.jar already exists in destination.
jetty-server-9.4.8.v20171121.jar already exists in destination.
jetty-servlet-9.4.8.v20171121.jar already exists in destination.
jetty-servlets-9.4.8.v20171121.jar already exists in destination.
jetty-util-9.4.8.v20171121.jar already exists in destination.
jetty-webapp-9.4.8.v20171121.jar already exists in destination.
jetty-xml-9.4.8.v20171121.jar already exists in destination.
spatial4j-0.7.jar already exists in destination.
noggit-0.8.jar already exists in destination.
asm-5.1.jar already exists in destination.
asm-commons-5.1.jar already exists in destination.
org.restlet-2.3.0.jar already exists in destination.
org.restlet.ext.servlet-2.3.0.jar already exists in destination.
jcl-over-slf4j-1.7.24.jar already exists in destination.
slf4j-api-1.7.24.jar already exists in destination.

Reply via email to