Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2018-12-31 Thread Hasan Diwan
Perhaps https://royvanrijn.com/blog/2016/03/java-mail-message-as-download/
may be helpful? Though I see the date on it and am now unsure. -- H

On Mon, 31 Dec 2018 at 17:51, Zheng Lin Edwin Yeo 
wrote:

> Hi Alex,
>
> I have tried with a file that is HTML formatted, with those tags like
> , , , etc, and those gets removed during indexing.
>
> For tags like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*", I found that in the
> EML file, there are two different content type, text/html and text/plain.
> Could it be due to Tika getting the content type from text/html instead of
> text/plain?
>
> Regards,
> Edwin
>
> On Mon, 31 Dec 2018 at 23:52, Alexandre Rafalovitch 
> wrote:
>
> > EML is for emails, so there are probably some HTML-formatted emails
> > that you are getting. Probably with the alternative text-part. Outlook
> > would render HTML and/or use text part. I think you can just open EML
> > in an editor to check it out.
> >
> > As to URP, are you absolutely sure it is being used? It is not
> > declared as default, so you need to call it explicitly. Try setting a
> > field in there or some other clear flag that a record has been
> > processed.
> >
> > Regards,
> > Alex.
> >
> > On Sun, 30 Dec 2018 at 22:46, Zheng Lin Edwin Yeo 
> > wrote:
> > >
> > > These texts are likely from the original EML file data, but they are
> not
> > > visible in the content when the EML file is opened in Microsoft
> Outlook.
> > >
> > > I have already applied the HTMLStripFieldUpdateProcessorFactory in
> > > solrconfig.xml, but these texts are still showing up in the index.
> Below
> > is
> > > my configuration.
> > >
> > > 
> > >
> > >  > > class="solr.HTMLStripFieldUpdateProcessorFactory">
> > >
> > >> > name="fieldName">content_tcs
> > >
> > > 
> > >
> > >  > > class="solr.LogUpdateProcessorFactory" />
> > >
> > >  > > class="solr.RunUpdateProcessorFactory" />
> > >
> > > 
> > >
> > >
> > > Regards,
> > > Edwin
> > >
> > > On Mon, 31 Dec 2018 at 11:29, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > > wrote:
> > >
> > > > Specifically, a custome Update Request Processor chain can be used
> > before
> > > > indexing. Probably with HTMLStripFieldUpdateProcessorFactory
> > > > Regards,
> > > >  Alex
> > > >
> > > > On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore  > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I think this kind of text manipulation should be done before
> > indexing, if
> > > > > you have font-size font-family in your text, very likely you’re
> > indexing
> > > > an
> > > > > html with css.
> > > > > If I’m right, you’re just entering in a hell of words that should
> be
> > > > > removed from your text.
> > > > >
> > > > > On the other hand, if you have to do this at index time, a quick
> and
> > > > dirty
> > > > > solution is using the pattern-replace filter.
> > > > >
> > > > >
> > > > >
> > > >
> >
> https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter
> > > > >
> > > > > Ciao,
> > > > > Vincenzo
> > > > >
> > > > > --
> > > > > mobile: 3498513251
> > > > > skype: free.dev
> > > > >
> > > > > > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo <
> > edwinye...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I noticed that during the indexing of EMLfiles, there are words
> > like
> > > > > > "*FONT-SIZE:
> > > > > > 9pt; FONT-FAMILY: arial*" that are being indexed into the content
> > as
> > > > > well.
> > > > > >
> > > > > > Would like to check, how are we able to remove those words during
> > the
> > > > > > indexing?
> > > > > >
> > > > > > I am using Solr 7.5.0
> > > > > >
> > > > > > Regards,
> > > > > > Edwin
> > > > >
> > > >
> >
>


-- 
OpenPGP:
https://sks-keyservers.net/pks/lookup?op=get=0xFEBAD7FFD041BBA1
If you wish to request my time, please do so using
*bit.ly/hd1AppointmentRequest
*.
Si vous voudrais faire connnaisance, allez a *bit.ly/hd1AppointmentRequest
*.

Sent
from my mobile device
Envoye de mon portable


Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2018-12-31 Thread Zheng Lin Edwin Yeo
Hi Alex,

I have tried with a file that is HTML formatted, with those tags like
, , , etc, and those gets removed during indexing.

For tags like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*", I found that in the
EML file, there are two different content type, text/html and text/plain.
Could it be due to Tika getting the content type from text/html instead of
text/plain?

Regards,
Edwin

On Mon, 31 Dec 2018 at 23:52, Alexandre Rafalovitch 
wrote:

> EML is for emails, so there are probably some HTML-formatted emails
> that you are getting. Probably with the alternative text-part. Outlook
> would render HTML and/or use text part. I think you can just open EML
> in an editor to check it out.
>
> As to URP, are you absolutely sure it is being used? It is not
> declared as default, so you need to call it explicitly. Try setting a
> field in there or some other clear flag that a record has been
> processed.
>
> Regards,
> Alex.
>
> On Sun, 30 Dec 2018 at 22:46, Zheng Lin Edwin Yeo 
> wrote:
> >
> > These texts are likely from the original EML file data, but they are not
> > visible in the content when the EML file is opened in Microsoft Outlook.
> >
> > I have already applied the HTMLStripFieldUpdateProcessorFactory in
> > solrconfig.xml, but these texts are still showing up in the index. Below
> is
> > my configuration.
> >
> > 
> >
> >  > class="solr.HTMLStripFieldUpdateProcessorFactory">
> >
> >> name="fieldName">content_tcs
> >
> > 
> >
> >  > class="solr.LogUpdateProcessorFactory" />
> >
> >  > class="solr.RunUpdateProcessorFactory" />
> >
> > 
> >
> >
> > Regards,
> > Edwin
> >
> > On Mon, 31 Dec 2018 at 11:29, Alexandre Rafalovitch 
> > wrote:
> >
> > > Specifically, a custome Update Request Processor chain can be used
> before
> > > indexing. Probably with HTMLStripFieldUpdateProcessorFactory
> > > Regards,
> > >  Alex
> > >
> > > On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore  wrote:
> > >
> > > > Hi,
> > > >
> > > > I think this kind of text manipulation should be done before
> indexing, if
> > > > you have font-size font-family in your text, very likely you’re
> indexing
> > > an
> > > > html with css.
> > > > If I’m right, you’re just entering in a hell of words that should be
> > > > removed from your text.
> > > >
> > > > On the other hand, if you have to do this at index time, a quick and
> > > dirty
> > > > solution is using the pattern-replace filter.
> > > >
> > > >
> > > >
> > >
> https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter
> > > >
> > > > Ciao,
> > > > Vincenzo
> > > >
> > > > --
> > > > mobile: 3498513251
> > > > skype: free.dev
> > > >
> > > > > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > I noticed that during the indexing of EMLfiles, there are words
> like
> > > > > "*FONT-SIZE:
> > > > > 9pt; FONT-FAMILY: arial*" that are being indexed into the content
> as
> > > > well.
> > > > >
> > > > > Would like to check, how are we able to remove those words during
> the
> > > > > indexing?
> > > > >
> > > > > I am using Solr 7.5.0
> > > > >
> > > > > Regards,
> > > > > Edwin
> > > >
> > >
>


Re: How to access the Solr Admin GUI

2018-12-31 Thread Jörn Franke
Reverse proxy?

> Am 31.12.2018 um 22:48 schrieb s...@cid.is:
> 
> Hi all,
> 
> is there a way, better a solution, to access the Solr Admin GUI from outside 
> the server (via public web) while the Solr port 8983 is closed by a firewall 
> and only available inside the server via localhost?
> 
> Thanks in advance
> Walter Claassen
> 
> Alexandraweg 32
> D 64287 Darmstadt
> Fon +49-6151-4937961
> Fax +49-6151-4937969
> c...@cid.is
> 


How to access the Solr Admin GUI

2018-12-31 Thread solr

Hi all,

is there a way, better a solution, to access the Solr Admin GUI from  
outside the server (via public web) while the Solr port 8983 is closed  
by a firewall and only available inside the server via localhost?


Thanks in advance
Walter Claassen

Alexandraweg 32
D 64287 Darmstadt
Fon +49-6151-4937961
Fax +49-6151-4937969
c...@cid.is



Re: SOLR Cloud - Full index replication

2018-12-31 Thread Erick Erickson
No particular downside to increasing numRecordsToKeep except
there is some additional disk space required and a bit of
bookkeeping.

Frankly, though, that's a bandaid at best. There should be more
information in the logs about _why_ they go into recovery.

If you're indexing while nodes are down that would certainly
explain it. But it nodes are going into recovery when everything
is up and running, there should be _some_ messages in the
logs as to why.

Best,
Erick

On Sun, Dec 30, 2018 at 9:42 PM Doss  wrote:
>
> Thanks Erick!
>
> We are using SOLR version 7.0.1.
>
> is there any disadvantages if we increase  peer sync size to 1000 ?
>
> We have analysed the GC logs but we have not seen long GC pauses so far.
>
> We tried to find the reason for the full sync, but noting more informative,
> but we have seen too many logs which reads "No registered leader was found
> after waiting for 4000ms" followed by this full index.
>
> Thanks,
> Doss.
>
>
> On Sun, Dec 30, 2018 at 8:49 AM Erick Erickson 
> wrote:
>
> > No. There's a "peer sync" that will try to update from the leader's
> > transaction log if (and only if) the replica has fallen behind. By
> > "fallen behind" I mean it was unable to accept any updates for
> > some period of time. The default peer sync size is 100 docs,
> > you can make it larger see numRecordsToKeep here:
> > http://lucene.apache.org/solr/guide/7_6/updatehandlers-in-solrconfig.html
> >
> > Some observations though:
> > 12G heap for 250G of index on disk _may_ work, but I'd be looking at
> > the GC characteristics, particularly stop-the-world pauses.
> >
> > Your hard commit interval looks too long. I'd shorten it to < 1 minute
> > with openSearcher=false. See:
> >
> > https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> >
> > I'd concentrate on _why_ the replica goes into recovery in the first
> > place. You say you're on 7x, which one? Starting in 7.3 the recovery
> > logic was pretty thoroughly reworked, so _which_ 7x version is
> > important to know.
> >
> > The Solr logs should give you some idea of _why_ the replica
> > goes into recovery, concentrate on the replica that goes into
> > recovery and the corresponding leader's log.
> >
> > Best,
> > Erick
> >
> > On Sat, Dec 29, 2018 at 6:23 PM Doss  wrote:
> > >
> > > we are using 3 node solr (64GB ram/8cpu/12GB heap)cloud setup with
> > version
> > > 7.X. we have 3 indexes/collection on each node. index size were about
> > > 250GB. NRT with 5sec soft /10min hard commit. Sometimes in any one node
> > we
> > > are seeing full index replication started running..  is there any
> > > configuration which forces solr to replicate full , like 100/200 updates
> > > difference if a node sees with the leader ? - Thanks.
> >


Re: Facing issue while transforming and indexing custom JSON

2018-12-31 Thread Alexandre Rafalovitch
Do you have _src_ field declared in schema? It is just a non-indexed string:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.5.0/solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema#L169

Regards,
   Alex.

On Mon, 31 Dec 2018 at 04:35, Shubhangi Shinde
 wrote:
>
> Hi Team,
>
> I am waiting for your feedback. Any update on this issue?
>
> On Fri, Dec 28, 2018 at 12:11 PM Shubhangi Shinde <
> shubhangi.shi...@iauro.com> wrote:
>
> > Hi Team,
> >
> > I am using Apache Solr. I went through the below link 'https://lucene.
> > apache.org/solr
> > /guide/7_5/transforming-and-indexing-custom-json.html#setting-json-defaults'
> > for transforming and indexing custom JSON and added the code in one of the
> > core to upload a multilevel JSON. It is throwing the below error. I spent
> > so much time to solve this error but no luck.  The error is,
> >
> > {
> >   "responseHeader":{
> > "status":400,
> > "QTime":93},
> >   "error":{
> > "metadata":[
> >   "error-class","org.apache.solr.common.SolrException",
> >   "root-error-class","org.apache.solr.common.SolrException"],
> > "msg":"ERROR: [doc=5b62d25] unknown field '_src_'",
> > "code":400}}
> >
> > I added the below code in my solrconfig.xml file. Please check,
> >
> > 
> >   
> > 
> > _src_
> > 
> > true
> > 
> > text
> >   
> >
> >
> > Please check the issue on stack overflow on the below link,
> > https://stackoverflow.com/questions/53775064/apache-solr
> > -error-unknown-field-src
> >
> > Please let me know if you want some more information about this. Thanks in
> > advance.
> >
> > --
> >
> > *Shubhangi Shinde*
> > Sr. Software Engineer, iauro Systems Pvt. Ltd.
> > 020-64008585 | shubhangi.shi...@iauro.com | www.iauro.com
> > 
> > 
> >   
> >
>
>
> --
>
> *Shubhangi Shinde*
> Sr. Software Engineer, iauro Systems Pvt. Ltd.
> 020-64008585 | shubhangi.shi...@iauro.com | www.iauro.com
> 
> 
>   


Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content

2018-12-31 Thread Alexandre Rafalovitch
EML is for emails, so there are probably some HTML-formatted emails
that you are getting. Probably with the alternative text-part. Outlook
would render HTML and/or use text part. I think you can just open EML
in an editor to check it out.

As to URP, are you absolutely sure it is being used? It is not
declared as default, so you need to call it explicitly. Try setting a
field in there or some other clear flag that a record has been
processed.

Regards,
Alex.

On Sun, 30 Dec 2018 at 22:46, Zheng Lin Edwin Yeo  wrote:
>
> These texts are likely from the original EML file data, but they are not
> visible in the content when the EML file is opened in Microsoft Outlook.
>
> I have already applied the HTMLStripFieldUpdateProcessorFactory in
> solrconfig.xml, but these texts are still showing up in the index. Below is
> my configuration.
>
> 
>
>  class="solr.HTMLStripFieldUpdateProcessorFactory">
>
>name="fieldName">content_tcs
>
> 
>
>  class="solr.LogUpdateProcessorFactory" />
>
>  class="solr.RunUpdateProcessorFactory" />
>
> 
>
>
> Regards,
> Edwin
>
> On Mon, 31 Dec 2018 at 11:29, Alexandre Rafalovitch 
> wrote:
>
> > Specifically, a custome Update Request Processor chain can be used before
> > indexing. Probably with HTMLStripFieldUpdateProcessorFactory
> > Regards,
> >  Alex
> >
> > On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore  >
> > > Hi,
> > >
> > > I think this kind of text manipulation should be done before indexing, if
> > > you have font-size font-family in your text, very likely you’re indexing
> > an
> > > html with css.
> > > If I’m right, you’re just entering in a hell of words that should be
> > > removed from your text.
> > >
> > > On the other hand, if you have to do this at index time, a quick and
> > dirty
> > > solution is using the pattern-replace filter.
> > >
> > >
> > >
> > https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter
> > >
> > > Ciao,
> > > Vincenzo
> > >
> > > --
> > > mobile: 3498513251
> > > skype: free.dev
> > >
> > > > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo 
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > I noticed that during the indexing of EMLfiles, there are words like
> > > > "*FONT-SIZE:
> > > > 9pt; FONT-FAMILY: arial*" that are being indexed into the content as
> > > well.
> > > >
> > > > Would like to check, how are we able to remove those words during the
> > > > indexing?
> > > >
> > > > I am using Solr 7.5.0
> > > >
> > > > Regards,
> > > > Edwin
> > >
> >


Resolved Authorization Issue

2018-12-31 Thread Terry Steichen
Thanks, Dominique.  This appears to explain a LOT of past confusion.

Terry

On 12/31/18 5:26 AM, Dominique Bejean wrote:
> So in Solr standalone mode, only authentication is fully functional, not
> authorization !


Re: How to archive Solr cloud and delete the data?

2018-12-31 Thread Steve Rowe
Hi Rekha,

Do you know about Solr's Time Routed Aliases feature[1]?

Steve

[1] https://lucene.apache.org/solr/guide/7_6/time-routed-aliases.html

> On Dec 30, 2018, at 11:48 AM, Rekha  
> wrote:
> 
> Hi Solr Team, I want to archive my Solr data. Is there any api available to 
> archive data? I planned to read data by month wise and store that into 
> another collection. But this plan takes long time, as like adding new data 
> and new indexing. And when I delete the archived data from the main 
> collection disk size not get changed,  I mean after deletion also data 
> directory size is same. Deleted documents count only updated on the admin 
> GUI. When I Google for this, some body says based on merged policy when the 
> deleted documents reached 50%,then only it will removed from the disk. I 
> didnt clear with it. How can I delete and retain the deleted document 
> space? Which is the best way to archive data? Thanks, Rekha K



Re: RuleBasedAuthorizationPlugin configuration

2018-12-31 Thread Dominique Bejean
Hi,

In debugging mode, I discovered that only in SolrCloud mode the collection
name is extract from the request path in the init() method of
HttpSolrCall.java

   if (cores.isZooKeeperAware()) {
  // init collectionList (usually one name but not when there are
aliases)
  ...
}

So in Solr standalone mode, only authentication is fully fonctionnal, not
authorization !

Regards.

Dominique





Le dim. 30 déc. 2018 à 13:40, Dominique Bejean 
a écrit :

> Hi,
>
> After reading more carefully the log file, here is my understanding.
>
> The request
>
> http://2:xx@localhost:8983/solr/biblio/select?indent=on=*:*=json
>
> report this in log
>
> 2018-12-30 12:24:52.102 INFO  (qtp1731656333-20) [   x:biblio]
> o.a.s.s.HttpSolrCall USER_REQUIRED auth header Basic Mjox context :
> userPrincipal: [[principal: 2]] type: [READ], collections: [], Path:
> [/select] path : /select params :q=*:*=on=json
>
> collections is empty, so it looks like "/select" is not collection
> specific and so it is not possible to define read access by collection.
>
> Can someone confirm ?
>
> Regards
>
> Dominique
>
>
>
>
>
> Le ven. 21 déc. 2018 à 10:46, Dominique Bejean 
> a écrit :
>
>> Hi,
>>
>> I am trying to configure security.json file, in order to define the
>> following users and permissions :
>>
>>- user "admin" with all permissions on all collections
>>- user "read" with read  permissions  on all collections
>>- user "1" with only read  permissions  on biblio collection
>>- user "2" with only read  permissions  on personnes collection
>>
>> Here is my security.json file
>>
>> {
>>   "authentication":{
>> "blockUnknown":true,
>> "class":"solr.BasicAuthPlugin",
>> "credentials":{
>>   "admin":"4uwfcjV7bCqOdLF/Qn2wiTyC7zIWN6lyA1Bgp1yqZj0=
>> 7PCh68vhIlZXg1l45kSlvGKowMg1bm/L3eSfgT5dzjs=",
>>   "read":"azUFSo9/plsGkQGhSQuk8YXoir22pALVpP8wFkd7wlk=
>> gft4wNAeuvz7P8bv/Jv6TK94g516/qXe9cFWe/VlhDo=",
>>   "1":"azUFSo9/plsGkQGhSQuk8YXoir22pALVpP8wFkd7wlk=
>> gft4wNAeuvz7P8bv/Jv6TK94g516/qXe9cFWe/VlhDo=",
>>   "2":"azUFSo9/plsGkQGhSQuk8YXoir22pALVpP8wFkd7wlk=
>> gft4wNAeuvz7P8bv/Jv6TK94g516/qXe9cFWe/VlhDo="},
>> "":{"v":0}},
>>   "authorization":{
>> "class":"solr.RuleBasedAuthorizationPlugin",
>> "permissions":[
>>   {
>> "name":"all",
>> "role":"admin",
>> "index":1},
>>   {
>> "name":"read-biblio",
>> "path":"/select",
>> "role":["admin","read","r1"],
>> "collection":"biblio",
>> "index":2},
>>   {
>> "name":"read-personnes",
>> "path":"/select",
>> "role":["admin","read","r2"],
>> "collection":"personnes",
>> "index":3},
>>  {
>> "name":"read",
>> "collection":"*",
>> "role":["admin","read"],
>> "index":4}],
>> "user-role":{
>>   "admin":"admin",
>>   "read":"read",
>>   "1":"r1",
>>   "2":"r2"}
>>   }
>> }
>>
>>
>> I have a 403 errors for user 1 on biblio and user 2 on personnes while
>> using the "/select" requestHandler. However according to r1 and r2 roles
>> and premissions order, the access should be allowed.
>>
>> I have duplicated the TestRuleBasedAuthorizationPlugin.java class in
>> order to test these exact same permissions and roles. checkRules reports
>> access is allowed !!!
>>
>> I don't understand where is the problem. Any ideas ?
>>
>> Regards
>>
>> Dominique
>>
>>
>>
>>
>>
>>
>>
>>


Re: Facing issue while transforming and indexing custom JSON

2018-12-31 Thread Shubhangi Shinde
Hi Team,

I am waiting for your feedback. Any update on this issue?

On Fri, Dec 28, 2018 at 12:11 PM Shubhangi Shinde <
shubhangi.shi...@iauro.com> wrote:

> Hi Team,
>
> I am using Apache Solr. I went through the below link 'https://lucene.
> apache.org/solr
> /guide/7_5/transforming-and-indexing-custom-json.html#setting-json-defaults'
> for transforming and indexing custom JSON and added the code in one of the
> core to upload a multilevel JSON. It is throwing the below error. I spent
> so much time to solve this error but no luck.  The error is,
>
> {
>   "responseHeader":{
> "status":400,
> "QTime":93},
>   "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","org.apache.solr.common.SolrException"],
> "msg":"ERROR: [doc=5b62d25] unknown field '_src_'",
> "code":400}}
>
> I added the below code in my solrconfig.xml file. Please check,
>
> 
>   
> 
> _src_
> 
> true
> 
> text
>   
>
>
> Please check the issue on stack overflow on the below link,
> https://stackoverflow.com/questions/53775064/apache-solr
> -error-unknown-field-src
>
> Please let me know if you want some more information about this. Thanks in
> advance.
>
> --
>
> *Shubhangi Shinde*
> Sr. Software Engineer, iauro Systems Pvt. Ltd.
> 020-64008585 | shubhangi.shi...@iauro.com | www.iauro.com
> 
> 
>   
>


-- 

*Shubhangi Shinde*
Sr. Software Engineer, iauro Systems Pvt. Ltd.
020-64008585 | shubhangi.shi...@iauro.com | www.iauro.com