Re: Default Index config

2018-04-09 Thread mganeshs
Hi Shawn,

Thanks for the reply. 

Yes we use only one solr client. Though collection name is passed in the
function, we are using same client for now.

Regarding merge config, after reading lot of forums and listening to
presentation of revolution 2017, idea is to reduce the merge frequency, so
that CPU usage pattern will come down from 100 to 70% for a while and only
when merges happens it will go to 100% ( where as now it's always above 95%
) which we see it as not a good sign of CPU always more than 95% since we
run other components as well in this server. So to reduce the merge
frequency,  i was trying that. 

Thanks for sharing your config, will try to check with that too and post you
an update on the result. 

Thanks and Regards,




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Text in images are not extracted and indexed to content

2018-04-09 Thread Zheng Lin Edwin Yeo
Hi,

Currently I am facing issue whereby the text in images file like jpg, bmp
are not being extracted out and indexed. After the indexing, Tika did
extract all the meta data out and index them under the fields attr_*.
However, the content field is always empty for images file. For other types
of document files like .doc, the content is extracted correctly.

I have already updated the tika-parsers-1.17.jar, under
\prg\apache\tika\parser\pdf\ for extractInlineImages to true.


What could be the reason?

I have just upgraded to Solr 7.3.0.

Regards,
Edwin


Re: Confusing error when creating a new core with TLS, service enabled

2018-04-09 Thread Shawn Heisey
On 4/9/2018 12:58 PM, Christopher Schultz wrote:
> After playing-around with a Solr 7.2.1 instance launched from the
> extracted tarball, I decided to go ahead and create a "real service" on
> my Debian-based server.
>
> I've run the 7.3.0 install script, configured Solr for TLS, and moved my
> existing configuration into the data directory, here:

What was the *precise* command you used to install Solr?  Looking for
all the options you used, so I know where things are.  There shouldn't
be anything sensitive in that command, so I don't think you need to
redact it at all.  Also, what exactly did you add to
/etc/default/solr.in.sh?  Redact any passwords you put there if you need to.

> When trying to create a new core, I get an NPE running:
>
> $ /usr/local/solr/bin/solr create -V -c new_core
>
> WARNING: Using _default configset with data driven schema functionality.
> NOT RECOMMENDED for production use.
>  To turn off: bin/solr config -c new_core -p 8983 -property
> update.autoCreateFields -value false
> Exception in thread "main" java.lang.NullPointerException
>   at org.apache.solr.util.SolrCLI.getJson(SolrCLI.java:731)
>   at org.apache.solr.util.SolrCLI.getJson(SolrCLI.java:642)
>   at org.apache.solr.util.SolrCLI$CreateTool.runImpl(SolrCLI.java:1773)
>   at org.apache.solr.util.SolrCLI$ToolBase.runTool(SolrCLI.java:176)
>   at org.apache.solr.util.SolrCLI.main(SolrCLI.java:282)

Due to the way the code is written there in version 7.3, the exact
nature of the problem is lost and it's not possible to see it without a
change to the source code.  If you want to build a patched version of
7.3, you could re-run it to see exactly what happened.  Here's an issue
for the NPE problem:

https://issues.apache.org/jira/browse/SOLR-12206

Best guess about the error that it got:  When you ran the create
command, I think that Java was not able to validate the SSL certificate
from the Solr server.  This would be consistent with what I saw in the
source code.

For the problem you had later with "-force" ... this is *exactly* why
you shouldn't run bin/solr as root.  What happened is that the new core
directory was created as root, owned by root.  But then when Solr tried
to add the core, it needed to write a core.properties file to that
directory, but was not able to do so, probably because it's running as
"solr" and has no write permission in a directory owned by root.

The error in the message from the command with "-force" seems to have
schizophrenia.  It says it's trying to create a core named
"cshultz_patients" but the error mentions
/var/solr/data/new_core/core.properties ... which should only happen if
the core is named "new_core".  If you're going to redact log messages,
please be sure to do it in an entirely consistent manner.  If you didn't
edit that log, then that's very strange.

Thanks,
Shawn



Re: replication

2018-04-09 Thread John Blythe
Thanks a bunch for the thorough reply, Shawn.

Phew. We’d chosen to go w Master-slave replication instead of SolrCloud per
the sudden need we had encountered and the desire to avoid the nuances and
changes related to moving to SolrCloud. But so much for this being a more
straightforward solution, huh?

Few questions:
- should we try to bite the solrcloud bullet and be done w it?
- is there some more config work we could put in place to avoid the soft
commit issue and the ultra large merge dangers, keeping the replications
happening quickly?
- maybe for our initial need we use Master for writing and user access in
NRT events, but slaves for the heavier backend processing. Thoughts?
- anyone do consulting on this that would be interested in chatting?

Thanks again!

On Mon, Apr 9, 2018 at 18:18 Shawn Heisey  wrote:

> On 4/9/2018 12:15 PM, John Blythe wrote:
> > we're starting to dive into master/slave replication architecture. we'll
> > have 1 master w 4 slaves behind it. our app is NRT. if user performs an
> > action in section A's data they may choose to jump to section B which
> will
> > be dependent on having the updates from their action in section A. as
> such,
> > we're thinking that the replication time should be set to 1-2s (the
> chances
> > of them arriving at section B quickly enough to catch the 2s gap is
> highly
> > unlikely at best).
>
> Once you start talking about master-slave replication, my assumption is
> that you're not running SolrCloud.  You would NOT want to try and mix
> SolrCloud with replication.  The features do not play well together.
> SolrCloud with NRT replicas (this is the only replica type that exists
> in 6.x and earlier) may be a better option than master-slave replication.
>
> > since the replicas will simply be looking for new files it seems like
> this
> > would be a lightweight operation even every couple seconds for 4
> replicas.
> > that said, i'm going *entirely* off of assumption at this point and
> wanted
> > to check in w you all to see any nuances, gotchas, hidden landmines, etc.
> > that we should be considering before rolling things out.
>
> Most of the time, you'd be correct to think that indexing is going to
> create a new small segment and replication will have little work to do.
> But as you create more and more segments, eventually Lucene is going to
> start merging those segments.  For discussion purposes, I'm going to
> describe a situation where each new segment during indexing is about
> 100KB in size, and the merge policy is left at the default settings.
> I'm also going to assume that no documents are getting deleted or
> reindexed (which will delete the old version).  Deleted documents can
> have an impact on merging, but it will usually only be a dramatic impact
> if there are a LOT of deleted documents.
>
> The first ten segments created will be this 100KB size.  Then Lucene is
> going to see that there are enough segments to trigger the merge policy
> - it's going to combine ten of those segments into one that's
> approximately one megabyte.  Repeat this ten times, and ten of those 1
> megabyte segments will be combined into one ten megabyte segment.
> Repeat all of THAT ten times, and there will be a 100 megabyte segment.
> And there will eventually be another level creating 1 gigabyte
> segments.  If the index is below 5GB in size, the entire thing *could*
> be merged into one segment by this process.
>
> The end result of all this:  Replication is not always going to be
> super-quick.  If merging creates a 1 gigabyte segment, then the amount
> of time to transfer that new segment is going to depend on how fast your
> disks are, and how fast your network is.  If you're using commodity SATA
> drives in the 4 to 10 terabyte range and a gigabit network, the network
> is probably going to be the bottleneck -- assuming that the system has
> plenty of memory and isn't under a high load.  If the network is the
> bottleneck in that situation, it's probably going to take close to ten
> seconds to transfer a 1GB segment, and the greater part of a minute to
> transfer a 5GB segment, which is the biggest one that the default merge
> policy configuration will create without an optimize operation.
>
> Also, you should understand something that has come to my attention
> recently (and is backed up by documentation):  If the master does a soft
> commit and the segment that was committed remains in memory (not flushed
> to disk), that segment will NOT be replicated to the slaves.  It has to
> get flushed to disk before it can be replicated.
>
> Thanks,
> Shawn
>
> --
John Blythe


Score certain documents higher based on a weight field

2018-04-09 Thread OTH
Hello,

Is there a way to assign a higher score to certain documents based on a
'weight' field?


E.g., if I have the following two documents:
{
"name":"United Kingdom",
"weight":2730,
} {
"name":"United States of America",
"weight":11246,
}

Currently, if I issue the following query:
q=name:united

These are the scores I get:
{
"name":"United Kingdom",
"weight":2730,
"score":9.464103},
} {
"name":"United States of America",
"weight":11246,
"score":7.766276}]
}


However, I'd like the score to somehow factor in the number in the "weight"
column.  (And hence, increase the score assigned to "United States of
America" in this case.)

Much thanks


Re: this IndexWriter is closed

2018-04-09 Thread Shawn Heisey
On 4/9/2018 12:31 PM, Jay Potharaju wrote:
> I am getting Indexwriter is closed error only on some of my shards in the
> collection. This seems to be happening on leader shards only. There is are
> other shards on the box and they are not throwing any error. Also there is
> enough disc space on the box available at this time.

> Caused by: java.io.IOException: No space left on device

Lucene (used by Solr) is reporting information given to it by Java,
which got its information from the operating system.  The OS said that
there's no space left.

There are a few possibilities here that I can think of:

1) The user that is running the application has a quota on the storage
and has reached that quota, so that specific user is being told there's
no space, while another user can see lots of space.

2) You've actually run out of some other resource besides literal
storage.  One example is that the storage volume has reached the maximum
number of inodes that it can store.  I don't know if this can result in
the same error message, but  it wouldn't surprise me.

http://blog.scoutapp.com/articles/2014/10/08/understanding-disk-inodes

3) You're looking at the wrong storage volume to see the free space.

Thanks,
Shawn



RE: [EXT] Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Hanjan, Harinder
Oh this is great! Saves me a whole bunch of manual work.

Thanks!

-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Monday, April 09, 2018 2:15 PM
To: solr-user@lucene.apache.org
Subject: [EXT] Re: How to use Tika (Solr Cell) to extract content from HTML 
document instead of Solr's MostlyPassthroughHtmlMapper ?

As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web service 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mattflax_dropwizard-2Dtika-2Dserver=DwIFaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=RkNfel_ImtzaUi1-fKXjGS0tiL3Vg2u2A2HKc0iMBGM=VrGqjG23NC5KbsEV-SZuu6s-Njx_XZRPp4uHkrmM_KY=
 written by a colleague of mine at Flax. Hope this is useful.

Cheers

Charlie

On 9 April 2018 at 19:26, Hanjan, Harinder 
wrote:

> Thank you Charlie, Tim.
> I will integrate Tika in my Java app and use SolrJ to send data to Solr.
>
>
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Monday, April 09, 2018 11:24 AM
> To: solr-user@lucene.apache.org
> Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from 
> HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
>
> +1
>
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__
> lucidworks.com_2012_02_14_indexing-2Dwith-2Dsolrj_=DwIGaQ=jdm1Hby_
> BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-
> HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_
> 3ndvYmpHBHjZXJ5pTMP2w=YbP_o22QJ_tsZDUPgSfDvEXZ9asBUFFHz53s2yTH8Q0=
>
>
>
> We should add a chatbot to the list that includes Charlie's advice and 
> the link to Erick's blog post whenever Tika is used. 
>
>
>
>
>
> -Original Message-
>
> From: Charlie Hull [mailto:char...@flax.co.uk]
>
> Sent: Monday, April 9, 2018 12:44 PM
>
> To: solr-user@lucene.apache.org
>
> Subject: Re: How to use Tika (Solr Cell) to extract content from HTML 
> document instead of Solr's MostlyPassthroughHtmlMapper ?
>
>
>
> I'd recommend you run Tika externally to Solr, which will allow you to 
> catch this kind of problem and prevent it bringing down your Solr 
> installation.
>
>
>
> Cheers
>
>
>
> Charlie
>
>
>
> On 9 April 2018 at 16:59, Hanjan, Harinder 
> 
>
> wrote:
>
>
>
> > Hello!
>
> >
>
> > Solr (i.e. Tika) throws a "zip bomb" exception with certain 
> > documents
>
> > we have in our Sharepoint system. I have used the tika-app.jar
>
> > directly to extract the document in question and it does _not_ throw
>
> > an exception and extract the contents just fine. So it would seem 
> > Solr
>
> > is doing something different than a Tika standalone installation.
>
> >
>
> > After some Googling, I found out that Solr uses its custom 
> > HtmlMapper
>
> > (MostlyPassthroughHtmlMapper) which passes through all elements in 
> > the
>
> > HTML document to Tika. As Tika limits nested elements to 100, this
>
> > causes Tika to throw an exception: Suspected zip bomb: 100 levels of
>
> > XML element nesting. This is metioned in TIKA-2091
>
> > (https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__issues.apache.org_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyK
> Du vdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=
> 7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=Il6-
> in8tGiAN3MaNlXmqvIkc3VyCCeG2qK2cGyMOuw0= jira/browse/TIKA-2091?
> focusedCommentId=15514131=com.atlassian.jira.
>
> > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). 
> > The
>
> > "solution" is to use Tika's default parsing/mapping mechanism but no
>
> > details have been provided on how to configure this at Solr.
>
> >
>
> > I'm hoping some folks here have the knowledge on how to configure 
> > Solr
>
> > to effectively by-pass its built in MostlyPassthroughHtmlMapper and
>
> > use Tika's implementation.
>
> >
>
> > Thank you!
>
> > Harinder
>
> >
>
> >
>
> > 
>
> > NOTICE -
>
> > This communication is intended ONLY for the use of the person or
>
> > entity named above and may contain information that is confidential 
> > or
>
> > legally privileged. If you are not the intended recipient named 
> > above
>
> > or a person responsible for delivering messages or communications to
>
> > the intended recipient, YOU ARE HEREBY NOTIFIED that any use,
>
> > distribution, or copying of this communication or any of the
>
> > information contained in it is strictly prohibited. If you have
>
> > received this communication in error, please notify us immediately 
> > by
>
> > telephone and then destroy or delete this communication, or return 
> > it
>
> > to us by mail if requested by us. The City of Calgary thanks you for
> your attention and co-operation.
>
> >
>
>


Re: replication

2018-04-09 Thread Shawn Heisey
On 4/9/2018 12:15 PM, John Blythe wrote:
> we're starting to dive into master/slave replication architecture. we'll
> have 1 master w 4 slaves behind it. our app is NRT. if user performs an
> action in section A's data they may choose to jump to section B which will
> be dependent on having the updates from their action in section A. as such,
> we're thinking that the replication time should be set to 1-2s (the chances
> of them arriving at section B quickly enough to catch the 2s gap is highly
> unlikely at best).

Once you start talking about master-slave replication, my assumption is
that you're not running SolrCloud.  You would NOT want to try and mix
SolrCloud with replication.  The features do not play well together. 
SolrCloud with NRT replicas (this is the only replica type that exists
in 6.x and earlier) may be a better option than master-slave replication.

> since the replicas will simply be looking for new files it seems like this
> would be a lightweight operation even every couple seconds for 4 replicas.
> that said, i'm going *entirely* off of assumption at this point and wanted
> to check in w you all to see any nuances, gotchas, hidden landmines, etc.
> that we should be considering before rolling things out.

Most of the time, you'd be correct to think that indexing is going to
create a new small segment and replication will have little work to do. 
But as you create more and more segments, eventually Lucene is going to
start merging those segments.  For discussion purposes, I'm going to
describe a situation where each new segment during indexing is about
100KB in size, and the merge policy is left at the default settings. 
I'm also going to assume that no documents are getting deleted or
reindexed (which will delete the old version).  Deleted documents can
have an impact on merging, but it will usually only be a dramatic impact
if there are a LOT of deleted documents.

The first ten segments created will be this 100KB size.  Then Lucene is
going to see that there are enough segments to trigger the merge policy
- it's going to combine ten of those segments into one that's
approximately one megabyte.  Repeat this ten times, and ten of those 1
megabyte segments will be combined into one ten megabyte segment. 
Repeat all of THAT ten times, and there will be a 100 megabyte segment. 
And there will eventually be another level creating 1 gigabyte
segments.  If the index is below 5GB in size, the entire thing *could*
be merged into one segment by this process.

The end result of all this:  Replication is not always going to be
super-quick.  If merging creates a 1 gigabyte segment, then the amount
of time to transfer that new segment is going to depend on how fast your
disks are, and how fast your network is.  If you're using commodity SATA
drives in the 4 to 10 terabyte range and a gigabit network, the network
is probably going to be the bottleneck -- assuming that the system has
plenty of memory and isn't under a high load.  If the network is the
bottleneck in that situation, it's probably going to take close to ten
seconds to transfer a 1GB segment, and the greater part of a minute to
transfer a 5GB segment, which is the biggest one that the default merge
policy configuration will create without an optimize operation.

Also, you should understand something that has come to my attention
recently (and is backed up by documentation):  If the master does a soft
commit and the segment that was committed remains in memory (not flushed
to disk), that segment will NOT be replicated to the slaves.  It has to
get flushed to disk before it can be replicated.

Thanks,
Shawn



Recover a Solr Node

2018-04-09 Thread Karthik Ramachandran
We are using Solr cloud with 3 nodes, no replication with 8 shard per node
per collection. We have multiple collection on that node.

We have backup of data the data folder, so we can recover it, is there a
way to reconstruct core.properties for all the replica's for that node?

-- 
With Thanks & Regards
Karthik


Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Charlie Hull
As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web
service https://github.com/mattflax/dropwizard-tika-server written by a
colleague of mine at Flax. Hope this is useful.

Cheers

Charlie

On 9 April 2018 at 19:26, Hanjan, Harinder 
wrote:

> Thank you Charlie, Tim.
> I will integrate Tika in my Java app and use SolrJ to send data to Solr.
>
>
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Monday, April 09, 2018 11:24 AM
> To: solr-user@lucene.apache.org
> Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from
> HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
>
> +1
>
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__
> lucidworks.com_2012_02_14_indexing-2Dwith-2Dsolrj_=DwIGaQ=jdm1Hby_
> BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-
> HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_
> 3ndvYmpHBHjZXJ5pTMP2w=YbP_o22QJ_tsZDUPgSfDvEXZ9asBUFFHz53s2yTH8Q0=
>
>
>
> We should add a chatbot to the list that includes Charlie's advice and the
> link to Erick's blog post whenever Tika is used. 
>
>
>
>
>
> -Original Message-
>
> From: Charlie Hull [mailto:char...@flax.co.uk]
>
> Sent: Monday, April 9, 2018 12:44 PM
>
> To: solr-user@lucene.apache.org
>
> Subject: Re: How to use Tika (Solr Cell) to extract content from HTML
> document instead of Solr's MostlyPassthroughHtmlMapper ?
>
>
>
> I'd recommend you run Tika externally to Solr, which will allow you to
> catch this kind of problem and prevent it bringing down your Solr
> installation.
>
>
>
> Cheers
>
>
>
> Charlie
>
>
>
> On 9 April 2018 at 16:59, Hanjan, Harinder 
>
> wrote:
>
>
>
> > Hello!
>
> >
>
> > Solr (i.e. Tika) throws a "zip bomb" exception with certain documents
>
> > we have in our Sharepoint system. I have used the tika-app.jar
>
> > directly to extract the document in question and it does _not_ throw
>
> > an exception and extract the contents just fine. So it would seem Solr
>
> > is doing something different than a Tika standalone installation.
>
> >
>
> > After some Googling, I found out that Solr uses its custom HtmlMapper
>
> > (MostlyPassthroughHtmlMapper) which passes through all elements in the
>
> > HTML document to Tika. As Tika limits nested elements to 100, this
>
> > causes Tika to throw an exception: Suspected zip bomb: 100 levels of
>
> > XML element nesting. This is metioned in TIKA-2091
>
> > (https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__issues.apache.org_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDu
> vdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=
> 7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=Il6-
> in8tGiAN3MaNlXmqvIkc3VyCCeG2qK2cGyMOuw0= jira/browse/TIKA-2091?
> focusedCommentId=15514131=com.atlassian.jira.
>
> > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The
>
> > "solution" is to use Tika's default parsing/mapping mechanism but no
>
> > details have been provided on how to configure this at Solr.
>
> >
>
> > I'm hoping some folks here have the knowledge on how to configure Solr
>
> > to effectively by-pass its built in MostlyPassthroughHtmlMapper and
>
> > use Tika's implementation.
>
> >
>
> > Thank you!
>
> > Harinder
>
> >
>
> >
>
> > 
>
> > NOTICE -
>
> > This communication is intended ONLY for the use of the person or
>
> > entity named above and may contain information that is confidential or
>
> > legally privileged. If you are not the intended recipient named above
>
> > or a person responsible for delivering messages or communications to
>
> > the intended recipient, YOU ARE HEREBY NOTIFIED that any use,
>
> > distribution, or copying of this communication or any of the
>
> > information contained in it is strictly prohibited. If you have
>
> > received this communication in error, please notify us immediately by
>
> > telephone and then destroy or delete this communication, or return it
>
> > to us by mail if requested by us. The City of Calgary thanks you for
> your attention and co-operation.
>
> >
>
>


Re: Confusing error when creating a new core with TLS, service enabled

2018-04-09 Thread Christopher Schultz
All,

On 4/9/18 2:58 PM, Christopher Schultz wrote:
> All,
> 
> After playing-around with a Solr 7.2.1 instance launched from the
> extracted tarball, I decided to go ahead and create a "real service" on
> my Debian-based server.
> 
> I've run the 7.3.0 install script, configured Solr for TLS, and moved my
> existing configuration into the data directory, here:
> 
> $ sudo ls -l /var/solr/data
> total 12
> drwxr-xr-x 4 solr solr 4096 Mar  5 15:12 test_core
> -rw-r- 1 solr solr 2117 Apr  9 09:49 solr.xml
> -rw-r- 1 solr solr  975 Apr  9 09:49 zoo.cfg
> 
> I have a single node, no ZK.
> 
> When trying to create a new core, I get an NPE running:
> 
> $ /usr/local/solr/bin/solr create -V -c new_core
> 
> WARNING: Using _default configset with data driven schema functionality.
> NOT RECOMMENDED for production use.
>  To turn off: bin/solr config -c new_core -p 8983 -property
> update.autoCreateFields -value false
> Exception in thread "main" java.lang.NullPointerException
>   at org.apache.solr.util.SolrCLI.getJson(SolrCLI.java:731)
>   at org.apache.solr.util.SolrCLI.getJson(SolrCLI.java:642)
>   at org.apache.solr.util.SolrCLI$CreateTool.runImpl(SolrCLI.java:1773)
>   at org.apache.solr.util.SolrCLI$ToolBase.runTool(SolrCLI.java:176)
>   at org.apache.solr.util.SolrCLI.main(SolrCLI.java:282)
> 
> 
> There is nothing being printed in the log files.
> 
> I thought it might be because I enabled TLS.
> 
> My /etc/default/solr.in.sh (which was created during installation)
> contains the minor configuration required for TLS, among other obvious
> things such as where my data resides.
> 
> I checked the /usr/local/solr/bin/solr script, and I can see that
> /etc/default/solr.in.sh in indeed checked and run it readable.
> 
> Readable.
> 
> The Solr installer (reasonably) makes all scripts, etc. readable only by
> the Solr user, and I'm never logged-in as Solr, so I can't read this
> file normally. I therefore ended up having to run the command like this:
> 
> $ sudo /usr/local/solr/bin/solr create -V -c new_core

Actually, then I got this error:

WARNING: Creating cores as the root user can cause Solr to fail and is
not advisable. Exiting.
 If you started Solr as root (not advisable either), force core
creation by adding argument -force

When adding "-force" to the command-line, I get an error about not being
able to persist core properties to a directory on the disk, with not
much detail:

2018-04-09 19:03:14.796 ERROR (qtp2114889273-17) [   ]
o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error
CREATEing SolrCore 'cschultz_patients': Couldn't persist core properties
to /var/solr/data/new_core/core.properties :
/var/solr/data/new_core/core.properties
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:989)
at
org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:90)
at
org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:358)
at
org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:389)
at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:174)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:195)
at 
org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:736)
at
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:717)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:498)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:384)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:330)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1629)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166)
at

Confusing error when creating a new core with TLS, service enabled

2018-04-09 Thread Christopher Schultz
All,

After playing-around with a Solr 7.2.1 instance launched from the
extracted tarball, I decided to go ahead and create a "real service" on
my Debian-based server.

I've run the 7.3.0 install script, configured Solr for TLS, and moved my
existing configuration into the data directory, here:

$ sudo ls -l /var/solr/data
total 12
drwxr-xr-x 4 solr solr 4096 Mar  5 15:12 test_core
-rw-r- 1 solr solr 2117 Apr  9 09:49 solr.xml
-rw-r- 1 solr solr  975 Apr  9 09:49 zoo.cfg

I have a single node, no ZK.

When trying to create a new core, I get an NPE running:

$ /usr/local/solr/bin/solr create -V -c new_core

WARNING: Using _default configset with data driven schema functionality.
NOT RECOMMENDED for production use.
 To turn off: bin/solr config -c new_core -p 8983 -property
update.autoCreateFields -value false
Exception in thread "main" java.lang.NullPointerException
at org.apache.solr.util.SolrCLI.getJson(SolrCLI.java:731)
at org.apache.solr.util.SolrCLI.getJson(SolrCLI.java:642)
at org.apache.solr.util.SolrCLI$CreateTool.runImpl(SolrCLI.java:1773)
at org.apache.solr.util.SolrCLI$ToolBase.runTool(SolrCLI.java:176)
at org.apache.solr.util.SolrCLI.main(SolrCLI.java:282)


There is nothing being printed in the log files.

I thought it might be because I enabled TLS.

My /etc/default/solr.in.sh (which was created during installation)
contains the minor configuration required for TLS, among other obvious
things such as where my data resides.

I checked the /usr/local/solr/bin/solr script, and I can see that
/etc/default/solr.in.sh in indeed checked and run it readable.

Readable.

The Solr installer (reasonably) makes all scripts, etc. readable only by
the Solr user, and I'm never logged-in as Solr, so I can't read this
file normally. I therefore ended up having to run the command like this:

$ sudo /usr/local/solr/bin/solr create -V -c new_core

This was unexpected, because "everything goes through the web service."
Well, everything except for figuring out how to connect to the web
service, of course.

I think maybe the bin/solr script should maybe dump a message saying
"Can't read file $configfile ; might not be able to connect to Solr" or
something? It would have saved me a ton of time.

Thanks,
-chris


this IndexWriter is closed

2018-04-09 Thread Jay Potharaju
Hi,
I am getting Indexwriter is closed error only on some of my shards in the
collection. This seems to be happening on leader shards only. There is are
other shards on the box and they are not throwing any error. Also there is
enough disc space on the box available at this time.

Solr: 5.3.0.

Any recommendations on how to address this issue??

null:org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:719)
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:733)
at 
org.apache.lucene.index.IndexWriter.deleteDocuments(IndexWriter.java:1438)
at 
org.apache.solr.update.DirectUpdateHandler2.deleteByQuery(DirectUpdateHandler2.java:408)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processDelete(RunUpdateProcessorFactory.java:80)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processDelete(UpdateRequestProcessor.java:55)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalDelete(DistributedUpdateProcessor.java:960)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doDeleteByQuery(DistributedUpdateProcessor.java:1360)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:1154)
at 
org.apache.solr.handler.loader.JavabinLoader.delete(JavabinLoader.java:163)
at 
org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:116)
at 
org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58)
at 
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:669)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:462)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:210)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:499)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Unknown Source)
Caused by: java.io.IOException: No space left on device
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.FileDispatcherImpl.write(Unknown Source)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source)
at sun.nio.ch.IOUtil.write(Unknown Source)
at sun.nio.ch.FileChannelImpl.write(Unknown Source)
at java.nio.channels.Channels.writeFullyImpl(Unknown Source)
at java.nio.channels.Channels.writeFully(Unknown Source)
at java.nio.channels.Channels.access$000(Unknown Source)
at java.nio.channels.Channels$1.write(Unknown Source)
at 
org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:271)
at java.util.zip.CheckedOutputStream.write(Unknown Source)
at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
at 

RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Hanjan, Harinder
Thank you Charlie, Tim.
I will integrate Tika in my Java app and use SolrJ to send data to Solr. 


-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, April 09, 2018 11:24 AM
To: solr-user@lucene.apache.org
Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from HTML 
document instead of Solr's MostlyPassthroughHtmlMapper ?

+1



https://urldefense.proofpoint.com/v2/url?u=https-3A__lucidworks.com_2012_02_14_indexing-2Dwith-2Dsolrj_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=YbP_o22QJ_tsZDUPgSfDvEXZ9asBUFFHz53s2yTH8Q0=



We should add a chatbot to the list that includes Charlie's advice and the link 
to Erick's blog post whenever Tika is used. 





-Original Message-

From: Charlie Hull [mailto:char...@flax.co.uk] 

Sent: Monday, April 9, 2018 12:44 PM

To: solr-user@lucene.apache.org

Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document 
instead of Solr's MostlyPassthroughHtmlMapper ?



I'd recommend you run Tika externally to Solr, which will allow you to catch 
this kind of problem and prevent it bringing down your Solr installation.



Cheers



Charlie



On 9 April 2018 at 16:59, Hanjan, Harinder 

wrote:



> Hello!

>

> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents 

> we have in our Sharepoint system. I have used the tika-app.jar 

> directly to extract the document in question and it does _not_ throw 

> an exception and extract the contents just fine. So it would seem Solr 

> is doing something different than a Tika standalone installation.

>

> After some Googling, I found out that Solr uses its custom HtmlMapper

> (MostlyPassthroughHtmlMapper) which passes through all elements in the 

> HTML document to Tika. As Tika limits nested elements to 100, this 

> causes Tika to throw an exception: Suspected zip bomb: 100 levels of 

> XML element nesting. This is metioned in TIKA-2091 

> (https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=Il6-in8tGiAN3MaNlXmqvIkc3VyCCeG2qK2cGyMOuw0=
>  jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira.

> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The 

> "solution" is to use Tika's default parsing/mapping mechanism but no 

> details have been provided on how to configure this at Solr.

>

> I'm hoping some folks here have the knowledge on how to configure Solr 

> to effectively by-pass its built in MostlyPassthroughHtmlMapper and 

> use Tika's implementation.

>

> Thank you!

> Harinder

>

>

> 

> NOTICE -

> This communication is intended ONLY for the use of the person or 

> entity named above and may contain information that is confidential or 

> legally privileged. If you are not the intended recipient named above 

> or a person responsible for delivering messages or communications to 

> the intended recipient, YOU ARE HEREBY NOTIFIED that any use, 

> distribution, or copying of this communication or any of the 

> information contained in it is strictly prohibited. If you have 

> received this communication in error, please notify us immediately by 

> telephone and then destroy or delete this communication, or return it 

> to us by mail if requested by us. The City of Calgary thanks you for your 
> attention and co-operation.

>



replication

2018-04-09 Thread John Blythe
hi, all.

we're starting to dive into master/slave replication architecture. we'll
have 1 master w 4 slaves behind it. our app is NRT. if user performs an
action in section A's data they may choose to jump to section B which will
be dependent on having the updates from their action in section A. as such,
we're thinking that the replication time should be set to 1-2s (the chances
of them arriving at section B quickly enough to catch the 2s gap is highly
unlikely at best).

since the replicas will simply be looking for new files it seems like this
would be a lightweight operation even every couple seconds for 4 replicas.
that said, i'm going *entirely* off of assumption at this point and wanted
to check in w you all to see any nuances, gotchas, hidden landmines, etc.
that we should be considering before rolling things out.

thanks for any info!

--
John Blythe


Backup a solr cloud collection - timeout in 180s?

2018-04-09 Thread Petersen, Robert (Contr)
Shouldn't this just create the backup file(s) asynchronously? Can the timeout 
be adjusted?


Solr 7.2.1 with five nodes and the addrsearch collection is five shards x five 
replicas and "numFound":38837970 docs


Thx

Robi


http://myServer.corp.pvt:8983/solr/admin/collections?action=BACKUP=addrsearchBackup=addrsearch=/apps/logs/backups


  *
 *
responseHeader:
{
*
status: 500,
*
QTime: 180211
},
 *
error:
{
*
metadata:
[
   *
"error-class",
   *
"org.apache.solr.common.SolrException",
   *
"root-error-class",
   *
"org.apache.solr.common.SolrException"
],
*
msg: "backup the collection time out:180s",
  *


>From the logs:


2018-04-09 17:47:32.667 INFO  (qtp64830413-22) [   ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/collections 
params={name=addrsearchBackup=BACKUP=/apps/logs/backups=addrsearch}
 status=500 QTime=180211
2018-04-09 17:47:32.667 ERROR (qtp64830413-22) [   ] o.a.s.s.HttpSolrCall 
null:org.apache.solr.common.SolrException: backup the collection time out:180s
at 
org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:314)
at 
org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:246)
at 
org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:224)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)
at 
org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:735)
at 
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:716)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:497)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)





This communication is confidential. Frontier only sends and receives email on 
the basis of the terms set out at http://www.frontier.com/email_disclaimer.


How many SynonymGraphFilterFactory can I have?

2018-04-09 Thread Vincenzo D'Amore
Hi all,

in an Solr 4.8 schema I have a fieldType with few SynonymFilter filters at
index and few at query time.

Moving this old schema to Solr 7.3.0 I see that if I use SynonymGraphFilter
during indexing, I have to follow it with FlattenGraphFilter.

I also know that I cannot have multiple SynonymGraphFilter, because produce
a graph but cannot consume an incoming graph.

So, should I add an FlattenGraphFilter after each SynonymGraphFilter at
index time to have more than one?

And, again, how can I have many SynonymGraphFilter at query time? :)

Thanks in advance for your time.

Best regards,
Vincenzo

-- 
Vincenzo D'Amore


RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Allison, Timothy B.
+1

https://lucidworks.com/2012/02/14/indexing-with-solrj/

We should add a chatbot to the list that includes Charlie's advice and the link 
to Erick's blog post whenever Tika is used. 


-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Monday, April 9, 2018 12:44 PM
To: solr-user@lucene.apache.org
Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document 
instead of Solr's MostlyPassthroughHtmlMapper ?

I'd recommend you run Tika externally to Solr, which will allow you to catch 
this kind of problem and prevent it bringing down your Solr installation.

Cheers

Charlie

On 9 April 2018 at 16:59, Hanjan, Harinder 
wrote:

> Hello!
>
> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents 
> we have in our Sharepoint system. I have used the tika-app.jar 
> directly to extract the document in question and it does _not_ throw 
> an exception and extract the contents just fine. So it would seem Solr 
> is doing something different than a Tika standalone installation.
>
> After some Googling, I found out that Solr uses its custom HtmlMapper
> (MostlyPassthroughHtmlMapper) which passes through all elements in the 
> HTML document to Tika. As Tika limits nested elements to 100, this 
> causes Tika to throw an exception: Suspected zip bomb: 100 levels of 
> XML element nesting. This is metioned in TIKA-2091 
> (https://issues.apache.org/ 
> jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira.
> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The 
> "solution" is to use Tika's default parsing/mapping mechanism but no 
> details have been provided on how to configure this at Solr.
>
> I'm hoping some folks here have the knowledge on how to configure Solr 
> to effectively by-pass its built in MostlyPassthroughHtmlMapper and 
> use Tika's implementation.
>
> Thank you!
> Harinder
>
>
> 
> NOTICE -
> This communication is intended ONLY for the use of the person or 
> entity named above and may contain information that is confidential or 
> legally privileged. If you are not the intended recipient named above 
> or a person responsible for delivering messages or communications to 
> the intended recipient, YOU ARE HEREBY NOTIFIED that any use, 
> distribution, or copying of this communication or any of the 
> information contained in it is strictly prohibited. If you have 
> received this communication in error, please notify us immediately by 
> telephone and then destroy or delete this communication, or return it 
> to us by mail if requested by us. The City of Calgary thanks you for your 
> attention and co-operation.
>


Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Charlie Hull
I'd recommend you run Tika externally to Solr, which will allow you to
catch this kind of problem and prevent it bringing down your Solr
installation.

Cheers

Charlie

On 9 April 2018 at 16:59, Hanjan, Harinder 
wrote:

> Hello!
>
> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents we
> have in our Sharepoint system. I have used the tika-app.jar directly to
> extract the document in question and it does _not_ throw an exception and
> extract the contents just fine. So it would seem Solr is doing something
> different than a Tika standalone installation.
>
> After some Googling, I found out that Solr uses its custom HtmlMapper
> (MostlyPassthroughHtmlMapper) which passes through all elements in the HTML
> document to Tika. As Tika limits nested elements to 100, this causes Tika
> to throw an exception: Suspected zip bomb: 100 levels of XML element
> nesting. This is metioned in TIKA-2091 (https://issues.apache.org/
> jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira.
> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The
> "solution" is to use Tika's default parsing/mapping mechanism but no
> details have been provided on how to configure this at Solr.
>
> I'm hoping some folks here have the knowledge on how to configure Solr to
> effectively by-pass its built in MostlyPassthroughHtmlMapper and use Tika's
> implementation.
>
> Thank you!
> Harinder
>
>
> 
> NOTICE -
> This communication is intended ONLY for the use of the person or entity
> named above and may contain information that is confidential or legally
> privileged. If you are not the intended recipient named above or a person
> responsible for delivering messages or communications to the intended
> recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying
> of this communication or any of the information contained in it is strictly
> prohibited. If you have received this communication in error, please notify
> us immediately by telephone and then destroy or delete this communication,
> or return it to us by mail if requested by us. The City of Calgary thanks
> you for your attention and co-operation.
>


Uninverting stats on solr 5 and beyond

2018-04-09 Thread Matteo Grolla
Hi,
 on solr 4 the log contained informations about time spent and memory
consumed uninverting a field.
Where can I find this information on current version of solr?

Thanks

--excerpt from solr 4.10 log--

INFO  - 2018-04-09 15:57:58.720; org.apache.solr.request.UnInvertedField;
UnInverted multi-valued field
{field=cat,memSize=4371,tindexSize=51,time=0,phase1=0,nTerms=2,bigTerms=0,termInstances=4,uses=0}


How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Hanjan, Harinder
Hello!

Solr (i.e. Tika) throws a "zip bomb" exception with certain documents we have 
in our Sharepoint system. I have used the tika-app.jar directly to extract the 
document in question and it does _not_ throw an exception and extract the 
contents just fine. So it would seem Solr is doing something different than a 
Tika standalone installation.

After some Googling, I found out that Solr uses its custom HtmlMapper 
(MostlyPassthroughHtmlMapper) which passes through all elements in the HTML 
document to Tika. As Tika limits nested elements to 100, this causes Tika to 
throw an exception: Suspected zip bomb: 100 levels of XML element nesting. This 
is metioned in TIKA-2091 
(https://issues.apache.org/jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131).
 The "solution" is to use Tika's default parsing/mapping mechanism but no 
details have been provided on how to configure this at Solr.

I'm hoping some folks here have the knowledge on how to configure Solr to 
effectively by-pass its built in MostlyPassthroughHtmlMapper and use Tika's 
implementation.

Thank you!
Harinder



NOTICE -
This communication is intended ONLY for the use of the person or entity named 
above and may contain information that is confidential or legally privileged. 
If you are not the intended recipient named above or a person responsible for 
delivering messages or communications to the intended recipient, YOU ARE HEREBY 
NOTIFIED that any use, distribution, or copying of this communication or any of 
the information contained in it is strictly prohibited. If you have received 
this communication in error, please notify us immediately by telephone and then 
destroy or delete this communication, or return it to us by mail if requested 
by us. The City of Calgary thanks you for your attention and co-operation.


Query regarding LTR plugin in solr

2018-04-09 Thread Prateek Agarwal
Hi,

I'm working on ltr feature in solr. I have a feature like :
''' {
"store" : "my_feature_store",
"name" : "in_aggregated_terms",
"class" : "org.apache.solr.ltr.feature.SolrFeature",
"params" : { "q" : "{!func}scale(query({!payload_score
f=aggregated_terms func=max v=${query}}),0,100)" }
  } '''

Here the scaling function is taking a lot more time than expected.
Is there a way I could implement a customized class or any other way by
which I can reduce this time.

So basically I just want to scale the value which looks at the whole result
set instead of just the current document. Can I have/implement something
during normalization??


Thanks in advance


Regards,
Prateek


SOLR with Sitecore SXA

2018-04-09 Thread Saul Nachman
Do I ask for a subscription here first and then mail the main thread?


Regards

Saul


Re: Default Index config

2018-04-09 Thread Shawn Heisey

On 4/9/2018 4:04 AM, mganeshs wrote:

Regarding CPU high, when we are troubleshooting, we found that Merge threads
are keep on running and it's take most CPU time ( as per Visual JVM ).


With a one second autoSoftCommit, nearly constant indexing will produce 
a lot of very small index segments.  Those index segments will have to 
be merged eventually.  You have increased the merge policy numbers which 
will reduce the total number of merges, but each merge is going to be 
larger than it would with defaults, so it's going to take a little bit 
longer.  This isn't too big a deal with first-level merges, but at the 
higher levels, they do get large -- no matter what the configuration is.



*Note*: following is the code snippet we use for indexing / adding solr
document in batch per collection

/for (SolrCollectionList solrCollection : SolrCollectionList.values()) {
CollectionBucket collectionBucket = getCollectionBucket(solrCollection);
List solrInputDocuments =
collectionBucket.getSolrInputDocumentList();
String collectionName = collectionBucket.getCollectionName();
try {
if(solrInputDocuments.size() > 0) {
CloudSolrClient solrClient =
PlatformIndexManager.getInstance().getCloudSolrClient(collectionName);
solrClient.add(collectionName, solrInputDocuments);
}
}/

*where solrClient is created as below
*
/this.cloudSolrClient = new
CloudSolrClient.Builder().withZkHost(zooKeeperHost).withHttpClient(HttpClientUtil.HttpClientFactory.createHttpClient()).build();
this.cloudSolrClient.setZkClientTimeout(3);
/


Is that code running on the Solr server, or on a different machine?  Are 
you creating a SolrClient each time you use it, or have you created 
client objects that get re-used?


You don't need a different SolrClient object for each collection.  Your 
"getCloudSolrClient" method takes a collection name, which suggests 
there might be a different client object for each one.  Most of the 
time, you need precisely one client object for the entire application.



Hard commit is kept as automatic and set to 15000 ms.
In this process, we also see, when merge is happening, and already
maxMergeCount ( default one ) is reached, commits are getting delayed and
solrj client ( where we add document ) is getting blocked and once once of
Merge thread process the merge, then solrj client returns the result.
How do we avoid this blocking of solrj client ? Do I need to go out of
default config for this scenario? I mean change the merge factor
configuration ?

Can you suggest what would be merge config for such a scenario ? Based on
forums, I tried to change the merge settings to the following,


What are you trying to accomplish by changing the merge policy?  It's 
fine to find information for a config on the Internet, but you need to 
know what that config *does* before you use it, and make sure it aligns 
with your goals.  On mine, I change maxMergeAtOnce and segmentsPerTier 
to 35, and maxMergeAtOnceExplicit to 105.  I know exactly what I'm 
trying to do with this config -- reduce the frequency of merges.  Each 
merge is going to be larger with this config, but they will happen less 
frequently.  These three settings are the only ones that I change in my 
merge policy.  Changing all of the other settings that you have changed 
should not be necessary.  I make one other adjustment in this area -- to 
the merge scheduler.



In same solr node, we have multiple index / collection. In that case,
whether TieredMergePolicyFactory will be right option or for multiple
collection in same node we should go for other merge policy ( like LogByte
etc )


TieredMergePolicy was made the default policy after a great deal of 
testing and discussion by Lucene developers.  They found that it works 
better than the others for the vast majority of users.  It is likely the 
best choice for you too.


These are the settings that I use in indexConfig to reduce the impact of 
merges on my indexing:


  
    35
    35
    105
  
  
    1
    6
  

Note that this config is designed for 6.x and earlier.  I do not know if 
it will work in 7.x.  It probably needs to be adjusted to the new 
Factory config.  You can use it as a guide, though.


Thanks,
Shawn



RE: PreAnalyzed URP and SchemaRequest API

2018-04-09 Thread Markus Jelsma
Hello David,

The remote client has everything on the class path but just calling 
setTokenStream is not going to work. Remotely, all i get from SchemaRequest API 
is a AnalyzerDefinition. I haven't found any Solr code that allows me to 
transform that directly into an analyzer. If i had that, it would make things 
easy.

As far as i see it, i need to reconstruct a real Analyzer using 
AnalyzerDefinition's information. It won't be a problem, but it is cumbersome.

Thanks anyway,
Markus
 
-Original message-
> From:David Smiley 
> Sent: Thursday 5th April 2018 19:38
> To: solr-user@lucene.apache.org
> Subject: Re: PreAnalyzed URP and SchemaRequest API
> 
> Is this really a problem when you could easily enough create a TextField
> and call setTokenStream?
> 
> Does your remote client have Solr-core and all its dependencies on the
> classpath?   That's one way to do it... and presumably the direction you
> are going because you're asking how to work with PreAnalyzedParser which is
> in solr-core.  *Alternatively*, only bring in Lucene core and construct
> things yourself in the right format.  You could copy PreAnalyzedParser into
> your codebase so that you don't have to reinvent any wheels, even though
> that's awkward.  Perhaps that ought to be in Solrj?  But no we don't want
> SolrJ depending on Lucene-core, though it'd make a fine "optional"
> dependency.
> 
> On Wed, Apr 4, 2018 at 4:53 AM Markus Jelsma 
> wrote:
> 
> > Hello,
> >
> > We intend to move to PreAnalyzed URP for analysis offloading. Browsing the
> > Javadocs i came across the SchemaRequest API looking for a way to get a
> > Field object remotely, which i seem to need for
> > JsonPreAnalyzedParser.toFormattedString(Field f). But all i can get from
> > SchemaRequest API is FieldTypeRepresentation, which offers me
> > getIndexAnalyzer() but won't allow me to construct a Field object.
> >
> > So, to analyze remotely i do need an index-time analyzer. I can get it,
> > but not turn it into a Field object, which the PreAnalyzedParser for some
> > reason wants.
> >
> > Any hints here? I must be looking the wrong way.
> >
> > Many thanks!
> > Markus
> >
> -- 
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
> 


Re: Solr join With must clause in fq

2018-04-09 Thread Mikhail Khludnev
it might make sense to test on the recent versions of Solr.

On Sun, Apr 8, 2018 at 8:21 PM, manuj singh  wrote:

> Hi all,
> I am trying to debug a problem which i am facing and need some help.
>
> I have a solr query which does join on 2 different cores. so lets say my
> first core has following 3 docs
>
> { "id":"1", "m_id":"lebron", "some_info":"29" }
>
> { "id":"2", "m_id":"Wade", "matches_win":"29" }
>
> { "id":"3", "m_id":"lebron", "some_info":"1234" }
>
> my second core has the following docs
>
> { "m_id": "lebron", "team": "miami" }
>
> { "m_id": "Wade", "team": "miami" }
>
> so now we made an update to doc with lebron and changed the team to
> "clevelend". So the new docs in core 2 looks like this.
>
> { "m_id": "lebron", "team": "clevelend" }
>
> { "m_id": "Wade", "team": "miami" }
>
> now i am trying to join these 2 and finding the docs form core1 for team
> miami.
>
> my query looks like this
>
> fq=+{!join from=m_id to=m_id fromIndex=core2 force=true}team:miami
>
> I am expecting it to return doc with id=2 but what i am getting is document
> 1 and 2.
>
> I am not able to figure out what is the problem. Is the query incorrect ?
> or is there some issue in join.
>
> *Couple of observations.*
>
> 1.if i remove the + from the filter query it works as expected. so the
> following query works
>
> fq={!join from=m_id to=m_id fromIndex=core2 force=true}team:miami
>
> I am not sure how the Must clause affecting the query.
>
> *2.* Also if you look the original query is not returning document
> 3.(however its returning document 1 which has the same m_id). Now the only
> difference between doc 1 and doc3 is that doc1 was created when "lebron"
> was part of team: miami. and doc3 was created when the team got updated to
> "cleveland". So the join is working fine for the new docs in core1 but not
> for the old docs.
>
> 3.If i use q instead of fq the query returns results as expected.
>
> q=+{!join from=m_id to=m_id fromIndex=core2 force=true}team:miami
>
> and
>
> q={!join from=m_id to=m_id fromIndex=core2 force=true}team:miami
>
> Both of the above works.
>
> I am sure i am missing something how internally join works. I am trying to
> understand why fq has a different behavior then q with the Must(+) clause.
>
> I am using solr 4.10.
>
>
>
> Thanks
>
> Manuj
>



-- 
Sincerely yours
Mikhail Khludnev


Re: Match a phrase like "Apple iPhone 6 32GB white" with "iphone 6"

2018-04-09 Thread Alessandro Benedetti
Hi Sami,
I agree with Mikhail, if you have relatively complex data you could curate
your own knowledge base for products as use it for Named entity Recognition.
You can then search a field compatible_with the extracted entity.

If the scenario is simpler using the analysis chain you mentioned should
work (if the product names are always complete and well curated).

Cheers





--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io

On Mon, Apr 9, 2018 at 10:40 AM, Adhyan Arizki  wrote:

> You can just use synonyms for that.. rather hackish but it works
>
> On Mon, 9 Apr 2018, 05:06 Sami al Subhi,  wrote:
>
> > I think this filter will output the desired result:
> >
> > 
> >
> >
> >
> > 
> > 
> >
> >
> >
> > 
> >
> > indexing:
> > "iPhone 6" will be indexed as "iphone 6" (always a single token)
> >
> > querying:
> > so this will analyze "Apple iPhone 6 32GB white" to "apple", "apple
> > iphone",
> > "iphone", "iphone 6" and so on...
> > then here a match will be achieved using the 4th token.
> >
> >
> >  I dont see how this will result in false positive matching.
> >
> >
> >
> >
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> >
>


Re: Default Index config

2018-04-09 Thread mganeshs
Hi Shawn,

Regarding CPU high, when we are troubleshooting, we found that Merge threads
are keep on running and it's take most CPU time ( as per Visual JVM ). GC is
not causing any issue as we use the default GC and also tried with G1 as you
suggested over  here
  

Though it's only background process, we are suspecting whether it's causing
CPU to go high. 

Since we are using SOLR as real time indexing of data and depending on its
result immd. to show it in UI as well. So we keep adding document around 100
to 200 documents in parallel in a sec. Also it would be in batch of 20 solr
documents list in one add... 

*Note*: following is the code snippet we use for indexing / adding solr
document in batch per collection

/for (SolrCollectionList solrCollection : SolrCollectionList.values()) {
CollectionBucket collectionBucket = getCollectionBucket(solrCollection);
List solrInputDocuments =
collectionBucket.getSolrInputDocumentList();
String collectionName = collectionBucket.getCollectionName();
try {
if(solrInputDocuments.size() > 0) {
CloudSolrClient solrClient =
PlatformIndexManager.getInstance().getCloudSolrClient(collectionName);
solrClient.add(collectionName, solrInputDocuments);
}
}/

*where solrClient is created as below
*
/this.cloudSolrClient = new
CloudSolrClient.Builder().withZkHost(zooKeeperHost).withHttpClient(HttpClientUtil.HttpClientFactory.createHttpClient()).build();
this.cloudSolrClient.setZkClientTimeout(3);
/

Hard commit is kept as automatic and set to 15000 ms.
In this process, we also see, when merge is happening, and already
maxMergeCount ( default one ) is reached, commits are getting delayed and
solrj client ( where we add document ) is getting blocked and once once of 
Merge thread process the merge, then solrj client returns the result.
How do we avoid this blocking of solrj client ? Do I need to go out of
default config for this scenario? I mean change the merge factor
configuration ? 

Can you suggest what would be merge config for such a scenario ? Based on
forums, I tried to change the merge settings to the following,


30
30
30
2048
512
0.1
2048
2.0
10.0


But couldn't see any much change in the behaviour.

In same solr node, we have multiple index / collection. In that case,
whether TieredMergePolicyFactory will be right option or for multiple
collection in same node we should go for other merge policy ( like LogByte
etc ) 


Can you throw some light on this aspects ?
Regards,

 Regarding auto commit, we discussed lot with our product owners and atlast
> we are forced to keep it to 1sec and we couldn't increase further. As this
> itself, sometimes our customers says that they have to refresh their pages
> for couple of times to get the update from solr. So we can't increase
> further.

I understand pressure from nontechnical departments for very low 
response times. Executives, sales, and marketing are usually the ones 
making those kinds of demands. I think you should push back on that 
particular requirement on technical grounds.

A soft commit interval that low *can* contribute to performance issues.  
It doesn't always cause them, I'm just saying that it *can*.  Maybe 
increasing it to five or ten seconds could help performance, or maybe it 
will make no real difference at all.

> Yes. As of now only solr is running in that machine. But intially we were
> running along with hbase region servers and was working fine. But due to
> CPU
> spikes and OS disk cache, we are forced to move solr to separate machine.
> But just I checked, our solr data folder size is coming only to 17GB. 2
> collection has around 5GB and other are have 2 to 3 GB of size. If you say
> that only 2/3 of total size comes to OS disk cache, in top command VIRT
> property it's always 28G, which means more than what we have. Why is
> that...
> Pls check that top command & GC we used in this  doc
> https://docs.google.com/document/d/1SaKPbGAKEPP8bSbdvfX52gaLsYWnQfDqfmV802hWIiQ/edit?usp=sharing;

The VIRT memory should be about equivalent to the RES size plus the size 
of all the index data on the system.  So that looks about right.  The 
actual amount of memory allocated by Java for the heap and other memory 
structures is approximately equal to RES minus SHR.

I am not sure whether the SHR size gets counted in VIRT. It probably 
does.  On some Linux systems, SHR grows to a very high number, but when 
that happens, it typically doesn't reflect actual memory usage.  I do 
not know why this sometimes happens.That is a question for Oracle, since 
they are the current owners of Java.

Only 5GB is in the buff/cache area.  The system has 13GB of free 
memory.  That system is NOT low on memory.

With 4 CPUs, a load average in the 3-4 range is an indication that the 

Re: Match a phrase like "Apple iPhone 6 32GB white" with "iphone 6"

2018-04-09 Thread Adhyan Arizki
You can just use synonyms for that.. rather hackish but it works

On Mon, 9 Apr 2018, 05:06 Sami al Subhi,  wrote:

> I think this filter will output the desired result:
>
> 
>
>
>
> 
> 
>
>
>
> 
>
> indexing:
> "iPhone 6" will be indexed as "iphone 6" (always a single token)
>
> querying:
> so this will analyze "Apple iPhone 6 32GB white" to "apple", "apple
> iphone",
> "iphone", "iphone 6" and so on...
> then here a match will be achieved using the 4th token.
>
>
>  I dont see how this will result in false positive matching.
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


RE: ZKPropertiesWriter error DIH (SolrCloud 6.6.1)

2018-04-09 Thread msaunier
I up my subject. Thanks





-Message d'origine-
De : msaunier [mailto:msaun...@citya.com] 
Envoyé : jeudi 5 avril 2018 10:46
À : solr-user@lucene.apache.org
Objet : RE: ZKPropertiesWriter error DIH (SolrCloud 6.6.1)

I have use this process to create the DIH :

1. Create the BLOB collection:
* curl
http://localhost:8983/solr/admin/collections?action=CREATE=.system

2. Send definition and file for DIH
* curl -X POST -H 'Content-Type: application/octet-stream' --data-binary
@ solr-dataimporthandler-6.6.1.jar
http://localhost:8983/solr/.system/blob/DataImportHandler
* curl -X POST -H 'Content-Type: application/octet-stream' --data-binary
@ mysql-connector-java-5.1.46.jar
http://localhost:8983/solr/.system/blob/MySQLConnector
* curl http://localhost:8983/solr/advertisements2/config -H
'Content-type:application/json' -d '{"add-runtimelib": {
"name":"DataImportHandler", "version":1 }}'
* curl http://localhost:8983/solr/advertisements2/config -H
'Content-type:application/json' -d '{"add-runtimelib": {
"name":"MySQLConnector", "version":1 }}'

3. I have add on the config file the requestHandler with the API. Result :
###
  "/full-advertisements": {
"runtimeLib": true,
"version": 1,
"class": "org.apache.solr.handler.dataimport.DataImportHandler",
"defaults": {
  "config": "DIH/advertisements.xml"
},
"name": "/full-advertisements"
  },
###

4. I have add with the zkcli.sh script the .xml definition file in
/configs/advertisements2/DIH/advertisements.xml
###
















###

Thanks for your help.


-Message d'origine-
De : msaunier [mailto:msaun...@citya.com] 
Envoyé : mercredi 4 avril 2018 09:57
À : solr-user@lucene.apache.org
Cc : fharr...@citya.com
Objet : ZKPropertiesWriter error DIH (SolrCloud 6.6.1)

Hello,
I use Solr Cloud and I test DIH system in cloud, but I have this error :

Full Import
failed:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable
to PropertyWriter implementation:ZKPropertiesWriter at
org.apache.solr.handler.dataimport.DataImporter.createPropertyWriter(DataImp
orter.java:330)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.ja
va:411)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:474
)
at
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImport
er.java:457)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException at
org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:935)
at
org.apache.solr.handler.dataimport.DataImporter.createPropertyWriter(DataImp
orter.java:326)
... 4 more

My DIH definition on the cloud


















Call response :
 

http://localhost:8983/solr/advertisements2/full-advertisements?command=full-
import=false=true



0
2


true
1

DIH/advertisements.xml


full-import
idle




I don't understand why I have this error. Can you help me ?
Thanks you.

 





Re: Solr 7.3.0 loading OpenNLPExtractNamedEntitiesUpdateProcessorFactory

2018-04-09 Thread Ryan Yacyshyn
Hi Shawn,

I'm pretty sure the paths to load the jars in analysis-extras is correct,
the jars in /contrib/analysis-extras/lib load fine. I verified this by
changing the name of solr.OpenNLPTokenizerFactory to
solr.OpenNLPTokenizerFactory2
and saw the new error. Changing it back to solr.OpenNLPTokenizerFactory
(without the "2") doesn't throw any errors, so I'm assuming these two
jar files (opennlp-maxent-3.0.3.jar and opennlp-tools-1.8.3.jar) must be
loading.

I tried swapping the order in which these jars are loaded as well, but no
luck there.

I have attached my solr.log file after a restart. Also included is my
solrconfig.xml and managed-schema. The path to my config
is /Users/ryan/solr-7.3.0/server/solr/nlp/conf and this is where I have the
OpenNLP bin files (en-ner-person.bin, en-sent.bin, and en-token.bin).
Configs are derived from the _default configset.

On a mac, and my Java version is:

java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

Thanks,
Ryan



On Sun, 8 Apr 2018 at 21:34 Shawn Heisey  wrote:

> On 4/8/2018 2:36 AM, Ryan Yacyshyn wrote:
> > I'm running into a small problem loading
> > the OpenNLPExtractNamedEntitiesUpdateProcessorFactory class, getting an
> > error saying it's not found. I'm loading all the required jar files,
> > according to the readme:
>
> You've got a  element to load analysis-extras jars, but are you
> certain it's actually loading anything?
>
> Can you share a solr.log file created just after a Solr restart?  Not
> just a reload -- I'm asking for a restart so the log is more complete.
> With that, I can see what's happening and then ask more questions that
> may pinpoint something.
>
> Thanks,
> Shawn
>
>





  

  
  7.3.0

  

  
  
  

  
  

  
  

  
  

  
  
  

  
  

  
  ${solr.data.dir:}


  
  

  
  

  
  


















${solr.lock.type:native}













  


  
  
  
  
  
  

  
  



  ${solr.ulog.dir:}
  ${solr.ulog.numVersionBuckets:65536}




  ${solr.autoCommit.maxTime:15000}
  false





  ${solr.autoSoftCommit.maxTime:-1}




  

  
  

  
  


1024























true





20


200




  

  


  

  



false

  


  
  








  

  
  
  


  explicit
  10
  
  








  

  
  

  explicit
  json
  true

  


  
  

  explicit

  

  

  _text_

  

  
  

  true
  ignored_
  _text_

  

  

  
  

text_general





  default
  _text_
  solr.DirectSolrSpellChecker
  
  internal
  
  0.5
  
  2
  
  1
  
  5
  
  4
  
  0.01
  




  

  
  

  
  default
  on
  true
  10
  5
  5
  true
  true
  10
  5


  spellcheck

  

  
  

  
  

  true


  tvComponent

  

  

  
  

  
  

  true
  false


  terms

  


  
  

string
  

  
  

  explicit


  elevator

  

  
  

  
  
  

  100

  

  
  

  
  70
  
  0.5
  
  [-\w ,/\n\]{20,200}

  

  
  

  
  

  

  
  

  
  

  
  

  
  

  
  

  

  
  

  
  

  

  

  10
  .,!? 

  

  

  
  WORD
  
  
  en
  US

  

  

  

  
  
  
  
[^\w-\.]
_
  
  
  
  
  

  -MM-dd'T'HH:mm:ss.SSSZ
  -MM-dd'T'HH:mm:ss,SSSZ
  -MM-dd'T'HH:mm:ss.SSS
  -MM-dd'T'HH:mm:ss,SSS
  -MM-dd'T'HH:mm:ssZ
  -MM-dd'T'HH:mm:ss
  -MM-dd'T'HH:mmZ
  -MM-dd'T'HH:mm
  -MM-dd HH:mm:ss.SSSZ
  -MM-dd HH:mm:ss,SSSZ
  -MM-dd HH:mm:ss.SSS
  -MM-dd HH:mm:ss,SSS
  -MM-dd HH:mm:ssZ
  -MM-dd HH:mm:ss
  -MM-dd HH:mmZ
  -MM-dd HH:mm
  -MM-dd

  
  

  java.lang.String
  text_general
  
*_str
256
  
  
  true


  java.lang.Boolean
  booleans


  java.util.Date
  pdates


  java.lang.Long
  java.lang.Integer
  plongs


  java.lang.Number
  pdoubles