date:20160513

Re: backups of analyzingInfixSuggesterIndexDir

2016-05-13 Thread Arcadius Ahouansou

Hello Oakley.
I am not familiar with the backup process either.

The analyzingInfixSuggesterIndexDir not being in the backup may not be an
issue.

I would suggest you restore the backup on Solr and see whether it's created
automatically for you.

If not, there are many options like buildOnStartup/buildOnCommit or issue
the suggest.build=true command as a post-restore operation.

Arcadius.


On 13 May 2016 at 16:39, Erick Erickson  wrote:

> No option that I know of, but I'm not up on the details of backup,
> maybe someone else can chime in?
>
> I kind of doubt it though, the choice of where to put the suggest
> index is totally arbitrary so I'm not sure how backup/restore would
> know where to look.
>
> On Thu, May 12, 2016 at 8:09 AM, Oakley, Craig (NIH/NLM/NCBI) [C]
>  wrote:
> > Backup simply by copying the files? or is there some option by which to
> say "include analyzingInfixSuggesterIndexDir as well"?
> >
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: Wednesday, May 11, 2016 11:53 PM
> > To: solr-user 
> > Subject: Re: backups of analyzingInfixSuggesterIndexDir
> >
> > Well, it can always be rebuilt from the backed-up index. That suggester
> > reads the _stored_ fields from the docs to build up the suggester
> > index. With a lot of documents that could take a very long time though.
> >
> > If you desperately need it, AFAIK you'll have to back it up whenever
> > you build it I'm afraid.
> >
> > Best,
> > Erick
> >
> > On Wed, May 11, 2016 at 8:30 AM, Oakley, Craig (NIH/NLM/NCBI) [C]
> >  wrote:
> >> I have a client whose Solr installation creates a
> analyzingInfixSuggesterIndexDir directory besides index and tlog. I notice
> that this analyzingInfixSuggesterIndexDir is not included in backups
> (created by replication?command=backup). Is there a way to include this? Or
> does it not need to be backed-up?
> >>
> >> I haven't needed this yet, but wanted to ask before I find that I might
> need it.
>



-- 
Arcadius Ahouansou
Menelic Ltd | Applied Knowledge Is Power
M: 07908761999
W: www.menelic.com
---

Re: Streaming Expression joins not returning all results

2016-05-13 Thread Joel Bernstein

Also the hashJoin is going to read the entire entity table into memory. If
that's a large index that could be using lots of memory.

25 million docs should be ok to /export from one node, as long as you have
enough memory to load the docValues for the fields for sorting and
exporting.

Breaking down the query into it's parts will show where the issue is. Also
adding more heap might give you enough memory.

In my testing the max docs per second I've seen the /export handler push
from a single node is 650,000. In order to get 650,000 docs per second on
one node you have to partition the stream with workers. In my testing it
took 8 workers hitting one node to achieve the 650,000 docs per second.

But the numbers get big as the cluster grows. With 20 shards and 4 replicas
and 32 workers, you could export 52,000,000 docs per-second. With 40
shards, 5 replicas and 40 workers you could export 130,000,000 docs per
second.

So with large clusters you could do very large distributed joins with
sub-second performance.




Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, May 13, 2016 at 8:11 PM, Ryan Cutter  wrote:

> Thanks very much for the advice.  Yes, I'm running in a very basic single
> shard environment.  I thought that 25M docs was small enough to not require
> anything special but I will try scaling like you suggest and let you know
> what happens.
>
> Cheers, Ryan
>
> On Fri, May 13, 2016 at 4:53 PM, Joel Bernstein 
> wrote:
>
> > I would try breaking down the second query to see when the problems
> occur.
> >
> > 1) Start with just a single *:* search from one of the collections.
> > 2) Then test the innerJoin. The innerJoin won't take much memory as it's
> a
> > streaming merge join.
> > 3) Then try the full thing.
> >
> > If you're running a large join like this all on one host then you might
> not
> > have enough memory for the docValues and the two joins. In general
> > streaming is designed to scale by adding servers. It scales 3 ways:
> >
> > 1) Adding shards, splits up the index for more pushing power.
> > 2) Adding workers, partitions the streams and splits up the join / merge
> > work.
> > 3) Adding replicas, when you have workers you will add pushing power by
> > adding replicas. This is because workers will fetch partitions of the
> > streams from across the entire cluster. So ALL replicas will be pushing
> at
> > once.
> >
> > So, imagine a setup with 20 shards, 4 replicas, and 20 workers. You can
> > perform massive joins quickly.
> >
> > But for you're scenario and available hardware you can experiment with
> > different cluster sizes.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Fri, May 13, 2016 at 7:27 PM, Ryan Cutter 
> wrote:
> >
> > > qt="/export" immediately fixed the query in Question #1.  Sorry for
> > missing
> > > that in the docs!
> > >
> > > The second query (with /export) crashes the server so I was going to
> look
> > > at parallelization if you think that's a good idea.  It also seems
> unwise
> > > to joining into 26M docs so maybe I can reconfigure the query to run
> > along
> > > a more happy path :-)  The schema is very RDBMS-centric so maybe that
> > just
> > > won't ever work in this framework.
> > >
> > > Here's the log but it's not very helpful.
> > >
> > >
> > > INFO  - 2016-05-13 23:18:13.214; [c:triple s:shard1 r:core_node1
> > > x:triple_shard1_replica1] org.apache.solr.core.SolrCore;
> > > [triple_shard1_replica1]  webapp=/solr path=/export
> > >
> > >
> >
> params={q=*:*=false=triple_id,subject_id,type_id=type_id+asc=json=2.2}
> > > hits=26305619 status=0 QTime=61
> > >
> > > INFO  - 2016-05-13 23:18:13.747; [c:triple_type s:shard1 r:core_node1
> > > x:triple_type_shard1_replica1] org.apache.solr.core.SolrCore;
> > > [triple_type_shard1_replica1]  webapp=/solr path=/export
> > >
> > >
> >
> params={q=*:*=false=triple_type_id,triple_type_label=triple_type_id+asc=json=2.2}
> > > hits=702 status=0 QTime=2
> > >
> > > INFO  - 2016-05-13 23:18:48.504; [   ]
> > > org.apache.solr.common.cloud.ConnectionManager; Watcher
> > > org.apache.solr.common.cloud.ConnectionManager@6ad0f304
> > > name:ZooKeeperConnection Watcher:localhost:9983 got event WatchedEvent
> > > state:Disconnected type:None path:null path:null type:None
> > >
> > > INFO  - 2016-05-13 23:18:48.504; [   ]
> > > org.apache.solr.common.cloud.ConnectionManager; zkClient has
> disconnected
> > >
> > > ERROR - 2016-05-13 23:18:51.316; [c:triple s:shard1 r:core_node1
> > > x:triple_shard1_replica1] org.apache.solr.common.SolrException;
> > null:Early
> > > Client Disconnect
> > >
> > > WARN  - 2016-05-13 23:18:51.431; [   ]
> > > org.apache.zookeeper.ClientCnxn$SendThread; Session 0x154ac66c81e0002
> for
> > > server localhost/0:0:0:0:0:0:0:1:9983, unexpected error, closing socket
> > > connection and attempting reconnect
> > >
> > > java.io.IOException: Connection reset by peer
> > >
> > > at

Re: Streaming Expression joins not returning all results

2016-05-13 Thread Ryan Cutter

Thanks very much for the advice.  Yes, I'm running in a very basic single
shard environment.  I thought that 25M docs was small enough to not require
anything special but I will try scaling like you suggest and let you know
what happens.

Cheers, Ryan

On Fri, May 13, 2016 at 4:53 PM, Joel Bernstein  wrote:

> I would try breaking down the second query to see when the problems occur.
>
> 1) Start with just a single *:* search from one of the collections.
> 2) Then test the innerJoin. The innerJoin won't take much memory as it's a
> streaming merge join.
> 3) Then try the full thing.
>
> If you're running a large join like this all on one host then you might not
> have enough memory for the docValues and the two joins. In general
> streaming is designed to scale by adding servers. It scales 3 ways:
>
> 1) Adding shards, splits up the index for more pushing power.
> 2) Adding workers, partitions the streams and splits up the join / merge
> work.
> 3) Adding replicas, when you have workers you will add pushing power by
> adding replicas. This is because workers will fetch partitions of the
> streams from across the entire cluster. So ALL replicas will be pushing at
> once.
>
> So, imagine a setup with 20 shards, 4 replicas, and 20 workers. You can
> perform massive joins quickly.
>
> But for you're scenario and available hardware you can experiment with
> different cluster sizes.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, May 13, 2016 at 7:27 PM, Ryan Cutter  wrote:
>
> > qt="/export" immediately fixed the query in Question #1.  Sorry for
> missing
> > that in the docs!
> >
> > The second query (with /export) crashes the server so I was going to look
> > at parallelization if you think that's a good idea.  It also seems unwise
> > to joining into 26M docs so maybe I can reconfigure the query to run
> along
> > a more happy path :-)  The schema is very RDBMS-centric so maybe that
> just
> > won't ever work in this framework.
> >
> > Here's the log but it's not very helpful.
> >
> >
> > INFO  - 2016-05-13 23:18:13.214; [c:triple s:shard1 r:core_node1
> > x:triple_shard1_replica1] org.apache.solr.core.SolrCore;
> > [triple_shard1_replica1]  webapp=/solr path=/export
> >
> >
> params={q=*:*=false=triple_id,subject_id,type_id=type_id+asc=json=2.2}
> > hits=26305619 status=0 QTime=61
> >
> > INFO  - 2016-05-13 23:18:13.747; [c:triple_type s:shard1 r:core_node1
> > x:triple_type_shard1_replica1] org.apache.solr.core.SolrCore;
> > [triple_type_shard1_replica1]  webapp=/solr path=/export
> >
> >
> params={q=*:*=false=triple_type_id,triple_type_label=triple_type_id+asc=json=2.2}
> > hits=702 status=0 QTime=2
> >
> > INFO  - 2016-05-13 23:18:48.504; [   ]
> > org.apache.solr.common.cloud.ConnectionManager; Watcher
> > org.apache.solr.common.cloud.ConnectionManager@6ad0f304
> > name:ZooKeeperConnection Watcher:localhost:9983 got event WatchedEvent
> > state:Disconnected type:None path:null path:null type:None
> >
> > INFO  - 2016-05-13 23:18:48.504; [   ]
> > org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
> >
> > ERROR - 2016-05-13 23:18:51.316; [c:triple s:shard1 r:core_node1
> > x:triple_shard1_replica1] org.apache.solr.common.SolrException;
> null:Early
> > Client Disconnect
> >
> > WARN  - 2016-05-13 23:18:51.431; [   ]
> > org.apache.zookeeper.ClientCnxn$SendThread; Session 0x154ac66c81e0002 for
> > server localhost/0:0:0:0:0:0:0:1:9983, unexpected error, closing socket
> > connection and attempting reconnect
> >
> > java.io.IOException: Connection reset by peer
> >
> > at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> >
> > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> >
> > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> >
> > at sun.nio.ch.IOUtil.read(IOUtil.java:192)
> >
> > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
> >
> > at
> >
> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
> >
> > at
> >
> >
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> >
> > at
> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> >
> > On Fri, May 13, 2016 at 3:09 PM, Joel Bernstein 
> > wrote:
> >
> > > A couple of other things:
> > >
> > > 1) Your innerJoin can parallelized across workers to improve
> performance.
> > > Take a look at the docs on the parallel function for the details.
> > >
> > > 2) It looks like you might be doing graph operations with joins. You
> > might
> > > to take a look at the gatherNodes function coming in 6.1:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62693238
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Fri, May 13, 2016 at 5:57 PM, Joel Bernstein 
> > > wrote:
> > >
> > > > When doing things

Re: Does anybody crawl to a database and then index from the database to Solr?

2016-05-13 Thread Erick Erickson

Clayton:

I think you've done a pretty thorough investigation, I think you're
spot-on. The only thing I would add is that you _will_ reindex your
entire corpus multiple times. Count on it. Sometime, somewhere,
somebody will say "gee, wouldn't it be nice if we could ". And to support it you'll have to change your Solr
schema... which will almost certainly require you to re-index.

The other thing people have done for deleting documents is to create
triggers in your DB to insert the deleted doc IDs into, say, a
"deleted" table along with a timestamp. Whenever necessary/desirable,
run a cleanup task that finds all the IDs since the last time you ran
your deleting program to remove docs that have been flagged since
then.. Obviously you also have to keep a record around of the
timestamp of the last successful run of this program..

Or, frankly, since it takes so little time to rebuild from scratch
people have foregone any of that complexity and simply rebuild the
entire index periodically. You can use "collection aliasing" to do
this in the background and then switch searches atomically, it depends
somewhat on how long you can wait until you need to see (well, _not_
see) the deleted docs.

But this is all refinements, I think you're going down the right path.

And when you say "connector", are you talking DIH or an external (say
SolrJ) program?

Best,
Erick

On Fri, May 13, 2016 at 2:04 PM, John Bickerstaff
 wrote:
> I've been working on a less-complex thing along the same lines - taking all
> the data from our corporate database and pumping it into Kafka for
> long-term storage -- and the ability to "play back" all the Kafka messages
> any time we need to re-index.
>
> That simpler scenario has worked like a charm.  I don't need to massage the
> data much once it's at rest in Kafka, so that was a straightforward
> solution, although I could have gone with a DB and just stored the solr
> documents with their ID's one per row in a RDBMS...
>
> The rest sounds like good ideas for your situation as Solr isn't the best
> candidate for the kind of manipulation of data you're proposing and a
> database excels at that.  It's more work, but you get a lot more
> flexibility and you de-couple Solr from the data crawling as you say.
>
> It all sounds pretty good to me, but I've only been on the list here a
> short time - so I'll leave it to others to add their comments.
>
> On Fri, May 13, 2016 at 2:46 PM, Pryor, Clayton J 
> wrote:
>
>> Question:
>> Do any of you have your crawlers write to a database rather than directly
>> to Solr and then use a connector to index to Solr from the database?  If
>> so, have you encountered any issues with this approach?  If not, why not?
>>
>> I have searched forums and the Solr/Lucene email archives (including
>> browsing of http://www.apache.org/foundation/public-archives.html) but
>> have not found any discussions of this idea.  I am certain that I am not
>> the first person to think of it.  I suspect that I have just not figured
>> out the proper queries to find what I am looking for.  Please forgive me if
>> this idea has been discussed before and I just couldn't find the
>> discussions.
>>
>> Background:
>> I am new to Solr and have been asked to make improvements to our Solr
>> configurations and crawlers.  I have read that the Solr index should not be
>> considered a source of record data.  It is in essence a highly optimized
>> index to be used for generating search results rather than a retainer for
>> record copies of data.  The better approach is to rely on corporate data
>> sources for record data and retain the ability to completely blow away a
>> Solr index and repopulate it as needed for changing search requirements.
>> This made me think that perhaps it would be a good idea for us to create a
>> database of crawled data for our Solr index.  The idea is that the crawlers
>> would write their findings to a corporate supported database of our own
>> design for our own purposes and then we would populate our Solr index from
>> this database using a connector that writes from the database to the Solr
>> index.
>> The only disadvantage that I can think of for this approach is that we
>> will need to write a simple interface to the database that allows our admin
>> personnel to "Delete" a record from the Solr index.  Of course, it won't be
>> deleted from the database but simply flagged as not to be indexed to Solr.
>> It will then send a delete command to Solr for any successfully "deleted"
>> records from the database.  I suspect this admin interface will grow over
>> time but we really only need to be able to delete records from the database
>> for now.  All of the rest of our admin work is query related which can
>> still be done through the Solr Console.
>> I can think of the following advantages:
>>
>>   *   We have a corporate sponsored and backed up repository for our
>> crawled data which would buffer us from

Re: Streaming Expression joins not returning all results

2016-05-13 Thread Joel Bernstein

I would try breaking down the second query to see when the problems occur.

1) Start with just a single *:* search from one of the collections.
2) Then test the innerJoin. The innerJoin won't take much memory as it's a
streaming merge join.
3) Then try the full thing.

If you're running a large join like this all on one host then you might not
have enough memory for the docValues and the two joins. In general
streaming is designed to scale by adding servers. It scales 3 ways:

1) Adding shards, splits up the index for more pushing power.
2) Adding workers, partitions the streams and splits up the join / merge
work.
3) Adding replicas, when you have workers you will add pushing power by
adding replicas. This is because workers will fetch partitions of the
streams from across the entire cluster. So ALL replicas will be pushing at
once.

So, imagine a setup with 20 shards, 4 replicas, and 20 workers. You can
perform massive joins quickly.

But for you're scenario and available hardware you can experiment with
different cluster sizes.



Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, May 13, 2016 at 7:27 PM, Ryan Cutter  wrote:

> qt="/export" immediately fixed the query in Question #1.  Sorry for missing
> that in the docs!
>
> The second query (with /export) crashes the server so I was going to look
> at parallelization if you think that's a good idea.  It also seems unwise
> to joining into 26M docs so maybe I can reconfigure the query to run along
> a more happy path :-)  The schema is very RDBMS-centric so maybe that just
> won't ever work in this framework.
>
> Here's the log but it's not very helpful.
>
>
> INFO  - 2016-05-13 23:18:13.214; [c:triple s:shard1 r:core_node1
> x:triple_shard1_replica1] org.apache.solr.core.SolrCore;
> [triple_shard1_replica1]  webapp=/solr path=/export
>
> params={q=*:*=false=triple_id,subject_id,type_id=type_id+asc=json=2.2}
> hits=26305619 status=0 QTime=61
>
> INFO  - 2016-05-13 23:18:13.747; [c:triple_type s:shard1 r:core_node1
> x:triple_type_shard1_replica1] org.apache.solr.core.SolrCore;
> [triple_type_shard1_replica1]  webapp=/solr path=/export
>
> params={q=*:*=false=triple_type_id,triple_type_label=triple_type_id+asc=json=2.2}
> hits=702 status=0 QTime=2
>
> INFO  - 2016-05-13 23:18:48.504; [   ]
> org.apache.solr.common.cloud.ConnectionManager; Watcher
> org.apache.solr.common.cloud.ConnectionManager@6ad0f304
> name:ZooKeeperConnection Watcher:localhost:9983 got event WatchedEvent
> state:Disconnected type:None path:null path:null type:None
>
> INFO  - 2016-05-13 23:18:48.504; [   ]
> org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
>
> ERROR - 2016-05-13 23:18:51.316; [c:triple s:shard1 r:core_node1
> x:triple_shard1_replica1] org.apache.solr.common.SolrException; null:Early
> Client Disconnect
>
> WARN  - 2016-05-13 23:18:51.431; [   ]
> org.apache.zookeeper.ClientCnxn$SendThread; Session 0x154ac66c81e0002 for
> server localhost/0:0:0:0:0:0:0:1:9983, unexpected error, closing socket
> connection and attempting reconnect
>
> java.io.IOException: Connection reset by peer
>
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>
> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>
> at
> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
>
> at
>
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
>
> at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>
> On Fri, May 13, 2016 at 3:09 PM, Joel Bernstein 
> wrote:
>
> > A couple of other things:
> >
> > 1) Your innerJoin can parallelized across workers to improve performance.
> > Take a look at the docs on the parallel function for the details.
> >
> > 2) It looks like you might be doing graph operations with joins. You
> might
> > to take a look at the gatherNodes function coming in 6.1:
> >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62693238
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Fri, May 13, 2016 at 5:57 PM, Joel Bernstein 
> > wrote:
> >
> > > When doing things that require all the results (like joins) you need to
> > > specify the /export handler in the search function.
> > >
> > > qt="/export"
> > >
> > > The search function defaults to the /select handler which is designed
> to
> > > return the top N results. The /export handler always returns all
> results
> > > that match the query. Also keep in mind that the /export handler
> requires
> > > that sort fields and fl fields have docValues set.
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Fri, May 13, 2016 at 5:36 PM, Ryan Cutter

Re: Streaming Expression joins not returning all results

2016-05-13 Thread Ryan Cutter

qt="/export" immediately fixed the query in Question #1.  Sorry for missing
that in the docs!

The second query (with /export) crashes the server so I was going to look
at parallelization if you think that's a good idea.  It also seems unwise
to joining into 26M docs so maybe I can reconfigure the query to run along
a more happy path :-)  The schema is very RDBMS-centric so maybe that just
won't ever work in this framework.

Here's the log but it's not very helpful.


INFO  - 2016-05-13 23:18:13.214; [c:triple s:shard1 r:core_node1
x:triple_shard1_replica1] org.apache.solr.core.SolrCore;
[triple_shard1_replica1]  webapp=/solr path=/export
params={q=*:*=false=triple_id,subject_id,type_id=type_id+asc=json=2.2}
hits=26305619 status=0 QTime=61

INFO  - 2016-05-13 23:18:13.747; [c:triple_type s:shard1 r:core_node1
x:triple_type_shard1_replica1] org.apache.solr.core.SolrCore;
[triple_type_shard1_replica1]  webapp=/solr path=/export
params={q=*:*=false=triple_type_id,triple_type_label=triple_type_id+asc=json=2.2}
hits=702 status=0 QTime=2

INFO  - 2016-05-13 23:18:48.504; [   ]
org.apache.solr.common.cloud.ConnectionManager; Watcher
org.apache.solr.common.cloud.ConnectionManager@6ad0f304
name:ZooKeeperConnection Watcher:localhost:9983 got event WatchedEvent
state:Disconnected type:None path:null path:null type:None

INFO  - 2016-05-13 23:18:48.504; [   ]
org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected

ERROR - 2016-05-13 23:18:51.316; [c:triple s:shard1 r:core_node1
x:triple_shard1_replica1] org.apache.solr.common.SolrException; null:Early
Client Disconnect

WARN  - 2016-05-13 23:18:51.431; [   ]
org.apache.zookeeper.ClientCnxn$SendThread; Session 0x154ac66c81e0002 for
server localhost/0:0:0:0:0:0:0:1:9983, unexpected error, closing socket
connection and attempting reconnect

java.io.IOException: Connection reset by peer

at sun.nio.ch.FileDispatcherImpl.read0(Native Method)

at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)

at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)

at sun.nio.ch.IOUtil.read(IOUtil.java:192)

at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)

at
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)

at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)

at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)

On Fri, May 13, 2016 at 3:09 PM, Joel Bernstein  wrote:

> A couple of other things:
>
> 1) Your innerJoin can parallelized across workers to improve performance.
> Take a look at the docs on the parallel function for the details.
>
> 2) It looks like you might be doing graph operations with joins. You might
> to take a look at the gatherNodes function coming in 6.1:
>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62693238
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, May 13, 2016 at 5:57 PM, Joel Bernstein 
> wrote:
>
> > When doing things that require all the results (like joins) you need to
> > specify the /export handler in the search function.
> >
> > qt="/export"
> >
> > The search function defaults to the /select handler which is designed to
> > return the top N results. The /export handler always returns all results
> > that match the query. Also keep in mind that the /export handler requires
> > that sort fields and fl fields have docValues set.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Fri, May 13, 2016 at 5:36 PM, Ryan Cutter 
> wrote:
> >
> >> Question #1:
> >>
> >> triple_type collection has a few hundred docs and triple has 25M docs.
> >>
> >> When I search for a particular subject_id in triple which I know has 14
> >> results and do not pass in 'rows' params, it returns 0 results:
> >>
> >> innerJoin(
> >> search(triple, q=subject_id:1656521,
> >> fl="triple_id,subject_id,type_id",
> >> sort="type_id asc"),
> >> search(triple_type, q=*:*, fl="triple_type_id,triple_type_label",
> >> sort="triple_type_id asc"),
> >> on="type_id=triple_type_id"
> >> )
> >>
> >> When I do the same search with rows=1, it returns 14 results:
> >>
> >> innerJoin(
> >> search(triple, q=subject_id:1656521,
> >> fl="triple_id,subject_id,type_id",
> >> sort="type_id asc", rows=1),
> >> search(triple_type, q=*:*, fl="triple_type_id,triple_type_label",
> >> sort="triple_type_id asc", rows=1),
> >> on="type_id=triple_type_id"
> >> )
> >>
> >> Am I doing this right?  Is there a magic number to pass into rows which
> >> says "give me all the results which match this query"?
> >>
> >>
> >> Question #2:
> >>
> >> Perhaps related to the first question but I want to run the innerJoin()
> >> without the subject_id - rather have it use the results of another
> query.
> >> But this does not return any results.  I'm saying "search for this
> entity
> >>

Re: Streaming Expression joins not returning all results

2016-05-13 Thread Joel Bernstein

A couple of other things:

1) Your innerJoin can parallelized across workers to improve performance.
Take a look at the docs on the parallel function for the details.

2) It looks like you might be doing graph operations with joins. You might
to take a look at the gatherNodes function coming in 6.1:

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62693238

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, May 13, 2016 at 5:57 PM, Joel Bernstein  wrote:

> When doing things that require all the results (like joins) you need to
> specify the /export handler in the search function.
>
> qt="/export"
>
> The search function defaults to the /select handler which is designed to
> return the top N results. The /export handler always returns all results
> that match the query. Also keep in mind that the /export handler requires
> that sort fields and fl fields have docValues set.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, May 13, 2016 at 5:36 PM, Ryan Cutter  wrote:
>
>> Question #1:
>>
>> triple_type collection has a few hundred docs and triple has 25M docs.
>>
>> When I search for a particular subject_id in triple which I know has 14
>> results and do not pass in 'rows' params, it returns 0 results:
>>
>> innerJoin(
>> search(triple, q=subject_id:1656521,
>> fl="triple_id,subject_id,type_id",
>> sort="type_id asc"),
>> search(triple_type, q=*:*, fl="triple_type_id,triple_type_label",
>> sort="triple_type_id asc"),
>> on="type_id=triple_type_id"
>> )
>>
>> When I do the same search with rows=1, it returns 14 results:
>>
>> innerJoin(
>> search(triple, q=subject_id:1656521,
>> fl="triple_id,subject_id,type_id",
>> sort="type_id asc", rows=1),
>> search(triple_type, q=*:*, fl="triple_type_id,triple_type_label",
>> sort="triple_type_id asc", rows=1),
>> on="type_id=triple_type_id"
>> )
>>
>> Am I doing this right?  Is there a magic number to pass into rows which
>> says "give me all the results which match this query"?
>>
>>
>> Question #2:
>>
>> Perhaps related to the first question but I want to run the innerJoin()
>> without the subject_id - rather have it use the results of another query.
>> But this does not return any results.  I'm saying "search for this entity
>> based on id then use that result's entity_id as the subject_id to look
>> through the triple/triple_type collections:
>>
>> hashJoin(
>> innerJoin(
>> search(triple, q=*:*, fl="triple_id,subject_id,type_id",
>> sort="type_id asc"),
>> search(triple_type, q=*:*, fl="triple_type_id,triple_type_label",
>> sort="triple_type_id asc"),
>> on="type_id=triple_type_id"
>> ),
>> hashed=search(entity,
>> q=id:"urn:sid:entity:455dfa1aa27eedad21ac2115797c1580bb3b3b4e",
>> fl="entity_id,entity_label", sort="entity_id asc"),
>> on="subject_id=entity_id"
>> )
>>
>> Am I using doing this hashJoin right?
>>
>> Thanks very much, Ryan
>>
>
>

Re: Streaming Expression joins not returning all results

2016-05-13 Thread Joel Bernstein

When doing things that require all the results (like joins) you need to
specify the /export handler in the search function.

qt="/export"

The search function defaults to the /select handler which is designed to
return the top N results. The /export handler always returns all results
that match the query. Also keep in mind that the /export handler requires
that sort fields and fl fields have docValues set.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, May 13, 2016 at 5:36 PM, Ryan Cutter  wrote:

> Question #1:
>
> triple_type collection has a few hundred docs and triple has 25M docs.
>
> When I search for a particular subject_id in triple which I know has 14
> results and do not pass in 'rows' params, it returns 0 results:
>
> innerJoin(
> search(triple, q=subject_id:1656521, fl="triple_id,subject_id,type_id",
> sort="type_id asc"),
> search(triple_type, q=*:*, fl="triple_type_id,triple_type_label",
> sort="triple_type_id asc"),
> on="type_id=triple_type_id"
> )
>
> When I do the same search with rows=1, it returns 14 results:
>
> innerJoin(
> search(triple, q=subject_id:1656521, fl="triple_id,subject_id,type_id",
> sort="type_id asc", rows=1),
> search(triple_type, q=*:*, fl="triple_type_id,triple_type_label",
> sort="triple_type_id asc", rows=1),
> on="type_id=triple_type_id"
> )
>
> Am I doing this right?  Is there a magic number to pass into rows which
> says "give me all the results which match this query"?
>
>
> Question #2:
>
> Perhaps related to the first question but I want to run the innerJoin()
> without the subject_id - rather have it use the results of another query.
> But this does not return any results.  I'm saying "search for this entity
> based on id then use that result's entity_id as the subject_id to look
> through the triple/triple_type collections:
>
> hashJoin(
> innerJoin(
> search(triple, q=*:*, fl="triple_id,subject_id,type_id",
> sort="type_id asc"),
> search(triple_type, q=*:*, fl="triple_type_id,triple_type_label",
> sort="triple_type_id asc"),
> on="type_id=triple_type_id"
> ),
> hashed=search(entity,
> q=id:"urn:sid:entity:455dfa1aa27eedad21ac2115797c1580bb3b3b4e",
> fl="entity_id,entity_label", sort="entity_id asc"),
> on="subject_id=entity_id"
> )
>
> Am I using doing this hashJoin right?
>
> Thanks very much, Ryan
>

[Solr 6] Migration from Solr 4.10.2

2016-05-13 Thread Alessandro Benedetti

I'm planning a migration from 4.10.2 to 6.0 .
Because we generate the index on daily basis from scratch, we don't need to
migrate the index but actually only migrate the server instances.
With my team we were doing some experiments on some dev machines,
basically comparing Solr 4.10.2 and Solr 6.0 to check any functional and
performance regression in our use cases.

After setting up two installation on the same machine ( switching on and
off each version for doing comparison and experiments) we are verifying a
degradation of the performances with Solr 6.

Basically from a queryTime and throughput perspective Solr 6 is not
performing as well as Solr 4.10.2 .
Still need to start the proper investigations but this appears weird to me.
Will proceed with all the analysis of the case and a deep study of our
queries ( which anyway are mainly fq , faceting and grouping).

Any suggestion in particular to start with ? Has anyone experienced a
similar migration with similar experience ?
I will anyway explore also the mailing list in search for similar cases.

Cheers

-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Streaming Expression joins not returning all results

2016-05-13 Thread Ryan Cutter

Question #1:

triple_type collection has a few hundred docs and triple has 25M docs.

When I search for a particular subject_id in triple which I know has 14
results and do not pass in 'rows' params, it returns 0 results:

innerJoin(
search(triple, q=subject_id:1656521, fl="triple_id,subject_id,type_id",
sort="type_id asc"),
search(triple_type, q=*:*, fl="triple_type_id,triple_type_label",
sort="triple_type_id asc"),
on="type_id=triple_type_id"
)

When I do the same search with rows=1, it returns 14 results:

innerJoin(
search(triple, q=subject_id:1656521, fl="triple_id,subject_id,type_id",
sort="type_id asc", rows=1),
search(triple_type, q=*:*, fl="triple_type_id,triple_type_label",
sort="triple_type_id asc", rows=1),
on="type_id=triple_type_id"
)

Am I doing this right?  Is there a magic number to pass into rows which
says "give me all the results which match this query"?


Question #2:

Perhaps related to the first question but I want to run the innerJoin()
without the subject_id - rather have it use the results of another query.
But this does not return any results.  I'm saying "search for this entity
based on id then use that result's entity_id as the subject_id to look
through the triple/triple_type collections:

hashJoin(
innerJoin(
search(triple, q=*:*, fl="triple_id,subject_id,type_id",
sort="type_id asc"),
search(triple_type, q=*:*, fl="triple_type_id,triple_type_label",
sort="triple_type_id asc"),
on="type_id=triple_type_id"
),
hashed=search(entity,
q=id:"urn:sid:entity:455dfa1aa27eedad21ac2115797c1580bb3b3b4e",
fl="entity_id,entity_label", sort="entity_id asc"),
on="subject_id=entity_id"
)

Am I using doing this hashJoin right?

Thanks very much, Ryan

Re: Does anybody crawl to a database and then index from the database to Solr?

2016-05-13 Thread John Bickerstaff

I've been working on a less-complex thing along the same lines - taking all
the data from our corporate database and pumping it into Kafka for
long-term storage -- and the ability to "play back" all the Kafka messages
any time we need to re-index.

That simpler scenario has worked like a charm.  I don't need to massage the
data much once it's at rest in Kafka, so that was a straightforward
solution, although I could have gone with a DB and just stored the solr
documents with their ID's one per row in a RDBMS...

The rest sounds like good ideas for your situation as Solr isn't the best
candidate for the kind of manipulation of data you're proposing and a
database excels at that.  It's more work, but you get a lot more
flexibility and you de-couple Solr from the data crawling as you say.

It all sounds pretty good to me, but I've only been on the list here a
short time - so I'll leave it to others to add their comments.

On Fri, May 13, 2016 at 2:46 PM, Pryor, Clayton J 
wrote:

> Question:
> Do any of you have your crawlers write to a database rather than directly
> to Solr and then use a connector to index to Solr from the database?  If
> so, have you encountered any issues with this approach?  If not, why not?
>
> I have searched forums and the Solr/Lucene email archives (including
> browsing of http://www.apache.org/foundation/public-archives.html) but
> have not found any discussions of this idea.  I am certain that I am not
> the first person to think of it.  I suspect that I have just not figured
> out the proper queries to find what I am looking for.  Please forgive me if
> this idea has been discussed before and I just couldn't find the
> discussions.
>
> Background:
> I am new to Solr and have been asked to make improvements to our Solr
> configurations and crawlers.  I have read that the Solr index should not be
> considered a source of record data.  It is in essence a highly optimized
> index to be used for generating search results rather than a retainer for
> record copies of data.  The better approach is to rely on corporate data
> sources for record data and retain the ability to completely blow away a
> Solr index and repopulate it as needed for changing search requirements.
> This made me think that perhaps it would be a good idea for us to create a
> database of crawled data for our Solr index.  The idea is that the crawlers
> would write their findings to a corporate supported database of our own
> design for our own purposes and then we would populate our Solr index from
> this database using a connector that writes from the database to the Solr
> index.
> The only disadvantage that I can think of for this approach is that we
> will need to write a simple interface to the database that allows our admin
> personnel to "Delete" a record from the Solr index.  Of course, it won't be
> deleted from the database but simply flagged as not to be indexed to Solr.
> It will then send a delete command to Solr for any successfully "deleted"
> records from the database.  I suspect this admin interface will grow over
> time but we really only need to be able to delete records from the database
> for now.  All of the rest of our admin work is query related which can
> still be done through the Solr Console.
> I can think of the following advantages:
>
>   *   We have a corporate sponsored and backed up repository for our
> crawled data which would buffer us from any inadvertent losses of our Solr
> index.
>   *   We would divorce the time it takes to crawl web pages from the time
> it takes to populate our Solr index with data from the crawlers.  I have
> found that my Solr Connector takes minutes to populate the entire Solr
> index from the current Solr prod to the new Solr instances.  Compare that
> to hours and even days to actually crawl the web pages.
>   *   We use URLs for our unique IDs in our Solr index.  We can resolve
> the problem of retaining the shortest URL when duplicate content is
> detected in Solr simply by sorting the query used to populate Solr from the
> database by id length descending - this will ensure the last URL
> encountered for any duplicate is always the shortest.
>   *   We can easily ensure that certain classes of crawled content are
> always added last (or first if you prefer) whenever the data is indexed to
> Solr - rather than having to rely on the timing of crawlers.
>   *   We could quickly and easily rebuild our Solr index from scratch at
> any time.  This would be very valuable when changes to our Solr
> configurations require re-indexing our data.
>   *   We can assign unique boost values to individual "documents" at index
> time by assigning a boost value for that document in the database and then
> applying that boost at index time.
>   *   We can continuously run a batch program that removes broken links
> against this database with no impact to Solr and then refresh Solr on a
> more frequent basis than we do now because the connector

Does anybody crawl to a database and then index from the database to Solr?

2016-05-13 Thread Pryor, Clayton J

Question:
Do any of you have your crawlers write to a database rather than directly to 
Solr and then use a connector to index to Solr from the database?  If so, have 
you encountered any issues with this approach?  If not, why not?

I have searched forums and the Solr/Lucene email archives (including browsing 
of http://www.apache.org/foundation/public-archives.html) but have not found 
any discussions of this idea.  I am certain that I am not the first person to 
think of it.  I suspect that I have just not figured out the proper queries to 
find what I am looking for.  Please forgive me if this idea has been discussed 
before and I just couldn't find the discussions.

Background:
I am new to Solr and have been asked to make improvements to our Solr 
configurations and crawlers.  I have read that the Solr index should not be 
considered a source of record data.  It is in essence a highly optimized index 
to be used for generating search results rather than a retainer for record 
copies of data.  The better approach is to rely on corporate data sources for 
record data and retain the ability to completely blow away a Solr index and 
repopulate it as needed for changing search requirements.
This made me think that perhaps it would be a good idea for us to create a 
database of crawled data for our Solr index.  The idea is that the crawlers 
would write their findings to a corporate supported database of our own design 
for our own purposes and then we would populate our Solr index from this 
database using a connector that writes from the database to the Solr index.
The only disadvantage that I can think of for this approach is that we will 
need to write a simple interface to the database that allows our admin 
personnel to "Delete" a record from the Solr index.  Of course, it won't be 
deleted from the database but simply flagged as not to be indexed to Solr.  It 
will then send a delete command to Solr for any successfully "deleted" records 
from the database.  I suspect this admin interface will grow over time but we 
really only need to be able to delete records from the database for now.  All 
of the rest of our admin work is query related which can still be done through 
the Solr Console.
I can think of the following advantages:

  *   We have a corporate sponsored and backed up repository for our crawled 
data which would buffer us from any inadvertent losses of our Solr index.
  *   We would divorce the time it takes to crawl web pages from the time it 
takes to populate our Solr index with data from the crawlers.  I have found 
that my Solr Connector takes minutes to populate the entire Solr index from the 
current Solr prod to the new Solr instances.  Compare that to hours and even 
days to actually crawl the web pages.
  *   We use URLs for our unique IDs in our Solr index.  We can resolve the 
problem of retaining the shortest URL when duplicate content is detected in 
Solr simply by sorting the query used to populate Solr from the database by id 
length descending - this will ensure the last URL encountered for any duplicate 
is always the shortest.
  *   We can easily ensure that certain classes of crawled content are always 
added last (or first if you prefer) whenever the data is indexed to Solr - 
rather than having to rely on the timing of crawlers.
  *   We could quickly and easily rebuild our Solr index from scratch at any 
time.  This would be very valuable when changes to our Solr configurations 
require re-indexing our data.
  *   We can assign unique boost values to individual "documents" at index time 
by assigning a boost value for that document in the database and then applying 
that boost at index time.
  *   We can continuously run a batch program that removes broken links against 
this database with no impact to Solr and then refresh Solr on a more frequent 
basis than we do now because the connector will take minutes rather than 
hours/days to refresh the content.
  *   We can store additional information for the crawler to populate to Solr 
when available - such as:
 *   actual document last updated dates
 *   boost value for that document in the database
  *   This database could be used for other purposes such as:
 *   Identifying a subset of representative data to use for evaluation of 
configuration changes.
 *   Easy access to "indexed" data for analysis work done by those not 
familiar with Solr.
Thanks in advance for your feedback.
Sincerely,
Clay Pryor
R SE Computer Science
9537 - Knowledge Systems
Sandia National Laboratories

RE: Issue with Solr6 CDCR

2016-05-13 Thread Satvinder Singh

Also,

I am using an external zookeeper ensemble with 3 nodes.
Thanks

[http://www.nc4worldwide.com/_signature/nc4.png]

Satvinder Singh







Security Systems Engineer
satvinder.si...@nc4.com
703.682.6000 x276 direct
703.989.8030 cell
www.NC4.com






[http://www.nc4worldwide.com/_catalogs/masterpage/images/linkedin-sml.png] 
 
[http://www.nc4worldwide.com/_catalogs/masterpage/images/googleplus-sml.png]  
 
[http://www.nc4worldwide.com/_catalogs/masterpage/images/twitter-sml.png] 





From: Satvinder Singh [mailto:satvinder.si...@nc4.com]
Sent: Friday, May 13, 2016 2:38 PM
To: solr-user@lucene.apache.org
Subject: RE: Issue with Solr6 CDCR




[http://www.nc4worldwide.com/_signature/nc4.png]

Satvinder Singh







Security Systems Engineer
satvinder.si...@nc4.com
703.682.6000 x276 direct
703.989.8030 cell
www.NC4.com






[http://www.nc4worldwide.com/_catalogs/masterpage/images/linkedin-sml.png] 
 
[http://www.nc4worldwide.com/_catalogs/masterpage/images/googleplus-sml.png]  
 
[http://www.nc4worldwide.com/_catalogs/masterpage/images/twitter-sml.png] 





From: Satvinder Singh [mailto:satvinder.si...@nc4.com]
Sent: Friday, May 13, 2016 2:37 PM
To: solr-user@lucene.apache.org
Subject: Issue with Solr6 CDCR

Hi,

I am getting same errors even after I put in the latest suggested change.

If I put  after the  I get solr instance is 
not configured with cdcr update log. And if I put  before 
 I get  
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
Error Instantiating Update Handler, solr.DirectUpdateHandler2 failed to 
instantiate org.apache.solr.update.UpdateHandler..

Attached is my config file and solr log. FYI I am running Java 1.8.0_91 and 
solr 6.0.
Thanks

[http://www.nc4worldwide.com/_signature/nc4.png]

Satvinder Singh







Security Systems Engineer
satvinder.si...@nc4.com
703.682.6000 x276 direct
703.989.8030 cell
www.NC4.com






[http://www.nc4worldwide.com/_catalogs/masterpage/images/linkedin-sml.png] 
 
[http://www.nc4worldwide.com/_catalogs/masterpage/images/googleplus-sml.png]  
 
[http://www.nc4worldwide.com/_catalogs/masterpage/images/twitter-sml.png] 





Disclaimer: This message is intended only for the use of the individual or 
entity to which it is addressed and may contain information which is 
privileged, confidential, proprietary, or exempt from disclosure under 
applicable law. If you are not the intended recipient or the person responsible 
for delivering the message to the intended recipient, you are strictly 
prohibited from disclosing, distributing, copying, or in any way using this 
message. If you have received this communication in error, please notify the 
sender and destroy and delete any copies you may have received.

RE: Issue with Solr6 CDCR

2016-05-13 Thread Satvinder Singh




[http://www.nc4worldwide.com/_signature/nc4.png]

Satvinder Singh







Security Systems Engineer
satvinder.si...@nc4.com
703.682.6000 x276 direct
703.989.8030 cell
www.NC4.com






[http://www.nc4worldwide.com/_catalogs/masterpage/images/linkedin-sml.png] 
 
[http://www.nc4worldwide.com/_catalogs/masterpage/images/googleplus-sml.png]  
 
[http://www.nc4worldwide.com/_catalogs/masterpage/images/twitter-sml.png] 





From: Satvinder Singh [mailto:satvinder.si...@nc4.com]
Sent: Friday, May 13, 2016 2:37 PM
To: solr-user@lucene.apache.org
Subject: Issue with Solr6 CDCR

Hi,

I am getting same errors even after I put in the latest suggested change.

If I put  after the  I get solr instance is 
not configured with cdcr update log. And if I put  before 
 I get  
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
Error Instantiating Update Handler, solr.DirectUpdateHandler2 failed to 
instantiate org.apache.solr.update.UpdateHandler..

Attached is my config file and solr log. FYI I am running Java 1.8.0_91 and 
solr 6.0.
Thanks

[http://www.nc4worldwide.com/_signature/nc4.png]

Satvinder Singh







Security Systems Engineer
satvinder.si...@nc4.com
703.682.6000 x276 direct
703.989.8030 cell
www.NC4.com






[http://www.nc4worldwide.com/_catalogs/masterpage/images/linkedin-sml.png] 
 
[http://www.nc4worldwide.com/_catalogs/masterpage/images/googleplus-sml.png]  
 
[http://www.nc4worldwide.com/_catalogs/masterpage/images/twitter-sml.png] 





Disclaimer: This message is intended only for the use of the individual or 
entity to which it is addressed and may contain information which is 
privileged, confidential, proprietary, or exempt from disclosure under 
applicable law. If you are not the intended recipient or the person responsible 
for delivering the message to the intended recipient, you are strictly 
prohibited from disclosing, distributing, copying, or in any way using this 
message. If you have received this communication in error, please notify the 
sender and destroy and delete any copies you may have received.

Re: More Like This on not new documents

2016-05-13 Thread Nick D

https://wiki.apache.org/solr/MoreLikeThisHandler

Bottom of the page, using context streams. I believe this still works in
newer versions of Solr. Although I have not tested it on a new version of
Solr.

But if you plan on indexing the document anyways then just indexing and
then passing the ID to mlt isn't a bad thing at all.

Nick

On Fri, May 13, 2016 at 2:23 AM, Vincenzo D'Amore 
wrote:

> Hi all,
>
> anybody know if is there a chance to use the mlt component with a new
> document not existing in the collection?
>
> In other words, if I have a new document, should I always first add it to
> my collection and only then, using the mlt component, have the list of
> similar documents?
>
>
> Best regards,
> Vincenzo
>
>
> --
> Vincenzo D'Amore
> email: v.dam...@gmail.com
> skype: free.dev
> mobile: +39 349 8513251
>

Re: Need Help with Solr 6.0 Cross Data Center Replication

2016-05-13 Thread Erick Erickson

I changed the CDCR doc, Oliver could you take a glance and see if it
is clear now? All I changed was the sample solrconfig sections

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462

Thanks,
Erick

On Fri, May 13, 2016 at 6:23 AM, Oliver Rudolph
 wrote:
> Hi,
>
> I had the same problem. The documentation is kind of missleading here. You
> must not add a new  element to your config but update the
> existing . All you need to do is add the
> class="solr.CdcrUpdateLog" element to the  element inside your
> existing . Hope this helps!
>
>
> Mit freundlichen Grüßen / Kind regards
>
> Oliver Rudolph
>
> IBM Deutschland Research & Development GmbH
> Vorsitzender des Aufsichtsrats: Martina Koederitz
> Geschäftsführung: Dirk Wittkopp
> Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart,
> HRB 243294
>
>
>
>
>

Re: Error

2016-05-13 Thread Erick Erickson

This is the same problem, you're simply committing
too often, either soft commit or hard commit with
openSearcher=true.

https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

You haven't told us how you're committing, I'd
guess either
1> you have your solrconfig settings at some very
low number. Commits should be as long
as you can tolerate.
or
2> you're committing from some client that's
indexing. This is rarely A Good Thing. Either
just let your solrconfig settings handle it or
use the commitWithin form of cloudSolrClient.add().

Best,
Erick

On Wed, May 11, 2016 at 10:09 PM, Midas A  wrote:
> thanks for replying .
>
> PERFORMANCE WARNING: Overlapping onDeckSearchers=2
> one more warning is coming please suggest for this also.
>
> On Wed, May 11, 2016 at 7:53 PM, Ahmet Arslan 
> wrote:
>
>> Hi Midas,
>>
>> It looks like you are committing too frequently, cache warming cannot
>> catchup.
>> Either lower your commit rate, or disable cache auto warm
>> (autowarmCount=0).
>> You can also remove queries registered at newSearcher event if you have
>> defined some.
>>
>> Ahmet
>>
>>
>>
>> On Wednesday, May 11, 2016 2:51 PM, Midas A  wrote:
>> Hi i am getting following error
>>
>> org.apache.solr.common.SolrException: Error opening new searcher.
>> exceeded limit of maxWarmingSearchers=2, try again later.
>>
>>
>>
>> what should i do to remove it .
>>

Re: Is there an equivalent to an SQL "select distinct" in Solr

2016-05-13 Thread John Bickerstaff

I should clarify:

http:/XXX.XXX.XX.XX:8983/solr/yourCoreName/select
q=*%3A*=0=json=true=true=category

"yourCoreName" will get built in for you if you use the Solr Admin UI for
queries --

On Fri, May 13, 2016 at 9:36 AM, John Bickerstaff 
wrote:

> In case it's helpful for a quick and dirty peek at your facets, the
> following URL (in a browser or Curl) will get you basic facets for a field
> named "category" -- assuming you change the IP address / hostname to match
> yours.
>
> http:/XXX.XXX.XX.XX:8983/solr/statdx_shard1_replica3/select
> q=*%3A*=0=json=true=true=category
>
> You can also do this in the Admin UI by checking the "facet" box, and
> entering the field name in the facet.field that pops up.  You can leave the
> query field at the default *:*
>
> You need to make sure that you put a "0" in the rows field as well (right
> under "sort") in order to just get back the facet counts.
>
> On Fri, May 13, 2016 at 7:52 AM, Joel Bernstein 
> wrote:
>
>> You may also want to try out the SQL interface in Solr 6.0 which supports
>> SELECT DISTINCT queries.
>>
>>
>> https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface#ParallelSQLInterface-SELECTDISTINCTQueries
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Fri, May 13, 2016 at 9:47 AM, GW  wrote:
>>
>> > Thank you Shawn,
>> >
>> > I will toy with these over the weekend. Solr/Hadoop/Hbase has been a
>> nasty
>> > learning curve for me,
>> > It would probably would have been a lot easier if I didn't have 30
>> years of
>> > RDBMS stuck in my head.
>> >
>> > Again,
>> >
>> > Many thanks for your response.
>> >
>> >
>> > On 13 May 2016 at 08:57, Shawn Heisey  wrote:
>> >
>> > > On 5/13/2016 6:48 AM, GW wrote:
>> > > > Let's say I have 10,000 documents and there is a field named
>> "category"
>> > > and
>> > > > lets say there are 200 categories but I do not know what they are.
>> > > >
>> > > > My question: Is there a query/filter that can pull a list of
>> distinct
>> > > > categories?
>> > >
>> > > Sounds like a job for faceting or grouping.  Which one of them to use
>> > > will depend on exactly what you're trying to obtain in your results.
>> > >
>> > > https://cwiki.apache.org/confluence/display/solr/Faceting
>> > > https://cwiki.apache.org/confluence/display/solr/Result+Grouping
>> > >
>> > > Thanks,
>> > > Shawn
>> > >
>> > >
>> >
>>
>
>

Re: backups of analyzingInfixSuggesterIndexDir

2016-05-13 Thread Erick Erickson

No option that I know of, but I'm not up on the details of backup,
maybe someone else can chime in?

I kind of doubt it though, the choice of where to put the suggest
index is totally arbitrary so I'm not sure how backup/restore would
know where to look.

On Thu, May 12, 2016 at 8:09 AM, Oakley, Craig (NIH/NLM/NCBI) [C]
 wrote:
> Backup simply by copying the files? or is there some option by which to say 
> "include analyzingInfixSuggesterIndexDir as well"?
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Wednesday, May 11, 2016 11:53 PM
> To: solr-user 
> Subject: Re: backups of analyzingInfixSuggesterIndexDir
>
> Well, it can always be rebuilt from the backed-up index. That suggester
> reads the _stored_ fields from the docs to build up the suggester
> index. With a lot of documents that could take a very long time though.
>
> If you desperately need it, AFAIK you'll have to back it up whenever
> you build it I'm afraid.
>
> Best,
> Erick
>
> On Wed, May 11, 2016 at 8:30 AM, Oakley, Craig (NIH/NLM/NCBI) [C]
>  wrote:
>> I have a client whose Solr installation creates a 
>> analyzingInfixSuggesterIndexDir directory besides index and tlog. I notice 
>> that this analyzingInfixSuggesterIndexDir is not included in backups 
>> (created by replication?command=backup). Is there a way to include this? Or 
>> does it not need to be backed-up?
>>
>> I haven't needed this yet, but wanted to ask before I find that I might need 
>> it.

Re: Is there an equivalent to an SQL "select distinct" in Solr

2016-05-13 Thread John Bickerstaff

In case it's helpful for a quick and dirty peek at your facets, the
following URL (in a browser or Curl) will get you basic facets for a field
named "category" -- assuming you change the IP address / hostname to match
yours.

http:/XXX.XXX.XX.XX:8983/solr/statdx_shard1_replica3/select
q=*%3A*=0=json=true=true=category

You can also do this in the Admin UI by checking the "facet" box, and
entering the field name in the facet.field that pops up.  You can leave the
query field at the default *:*

You need to make sure that you put a "0" in the rows field as well (right
under "sort") in order to just get back the facet counts.

On Fri, May 13, 2016 at 7:52 AM, Joel Bernstein  wrote:

> You may also want to try out the SQL interface in Solr 6.0 which supports
> SELECT DISTINCT queries.
>
>
> https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface#ParallelSQLInterface-SELECTDISTINCTQueries
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, May 13, 2016 at 9:47 AM, GW  wrote:
>
> > Thank you Shawn,
> >
> > I will toy with these over the weekend. Solr/Hadoop/Hbase has been a
> nasty
> > learning curve for me,
> > It would probably would have been a lot easier if I didn't have 30 years
> of
> > RDBMS stuck in my head.
> >
> > Again,
> >
> > Many thanks for your response.
> >
> >
> > On 13 May 2016 at 08:57, Shawn Heisey  wrote:
> >
> > > On 5/13/2016 6:48 AM, GW wrote:
> > > > Let's say I have 10,000 documents and there is a field named
> "category"
> > > and
> > > > lets say there are 200 categories but I do not know what they are.
> > > >
> > > > My question: Is there a query/filter that can pull a list of distinct
> > > > categories?
> > >
> > > Sounds like a job for faceting or grouping.  Which one of them to use
> > > will depend on exactly what you're trying to obtain in your results.
> > >
> > > https://cwiki.apache.org/confluence/display/solr/Faceting
> > > https://cwiki.apache.org/confluence/display/solr/Result+Grouping
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
>

RE: dtSearch parser & Introduction

2016-05-13 Thread Allison, Timothy B.

>...and I've just blogged about some of the issues one can run into with this 
>sort of project, hope this is useful!
http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/

+1 completely non-trivial task to roll your own.

I'd add that incorporating multiterm analysis (analysis/normalization of 
wildcard, fuzzy, prefix, regex, etc) is a fundamental requirement too often 
overlooked.  If you don't do this correctly, you'll get results, but not all 
that you should be getting -- you won't know what you can't find. :)  

It would be great if Uwe could add a check for improperly ignoring 
normalization of multiterms to his forbiddenapis. :)

RE: http request to MiniSolrCloudCluster

2016-05-13 Thread Rohana Rajapakse

I am only setting up a MiniSolrCloudCluster with 2 servers like this:

JettyConfig  jettyConfig = 
JettyConfig.builder().waitForLoadingCoresToFinish(null).setContext("/solr").build();
MiniSolrCloudCluster  miniCluster = new MiniSolrCloudCluster(2, 
Paths.get(baseDir), jettyConfig);

I can see the "zookeeper", "node1",  "ndoe2" folders being created  (with 
content in them) in my $baseDir. I have not added any data to Solr index yet.

I don't know what "overseer" is and how to check status of it. My only concern 
is if things are not cleared in zookeeper. Is there any way to check zookeeper 
DB?

As I mentioned before, the cluster works fine when I access it via SolrClient 
(solrj). The issue is when making http requests.

Can someone please test making an http request to a MiniSolrCloudCluster 
(created outside of Solr) and let me know if it works fine.

The log messages during starting up the mini cloud include the following error 
messages:

...
15:34:11,750 INFO  ~ Watcher 
org.apache.solr.common.cloud.ConnectionManager@6e374fbf 
name:ZooKeeperConnection Watcher:127.0.0.1:15570/solr got event WatchedEvent 
state:SyncConnected type:None path:null path:null type:None
15:34:11,750 INFO  ~ Client is connected to ZooKeeper
15:34:11,752 INFO  ~ Got user-level KeeperException when processing 
sessionid:0x154aa89febb0006 type:create cxid:0x1 zxid:0x10 txntype:-1 
reqpath:n/a Error Path:/solr/overseer Error:KeeperErrorCode = NodeExists for 
/solr/overseer
15:34:11,774 INFO  ~ makePath: /overseer/queue
15:34:11,774 INFO  ~ makePath: /overseer/queue
15:34:11,776 INFO  ~ Got user-level KeeperException when processing 
sessionid:0x154aa89febb0006 type:create cxid:0x5 zxid:0x12 txntype:-1 
reqpath:n/a Error Path:/solr/overseer/queue Error:KeeperErrorCode = NodeExists 
for /solr/overseer/queue
15:34:11,799 INFO  ~ Got user-level KeeperException when processing 
sessionid:0x154aa89febb0006 type:create cxid:0x6 zxid:0x13 txntype:-1 
reqpath:n/a Error Path:/solr/overseer Error:KeeperErrorCode = NodeExists for 
/solr/overseer
15:34:11,799 INFO  ~ Got user-level KeeperException when processing 
sessionid:0x154aa89febb0005 type:create cxid:0x7 zxid:0x14 txntype:-1 
reqpath:n/a Error Path:/solr/overseer Error:KeeperErrorCode = NodeExists for 
/solr/overseer
15:34:11,823 INFO  ~ makePath: /overseer/collection-queue-work
15:34:11,824 INFO  ~ makePath: /overseer/collection-queue-work
15:34:11,825 INFO  ~ Got user-level KeeperException when processing 
sessionid:0x154aa89febb0005 type:create cxid:0xb zxid:0x16 txntype:-1 
reqpath:n/a Error Path:/solr/overseer/collection-queue-work 
Error:KeeperErrorCode = NodeExists for /solr/overseer/collection-queue-work
15:34:11,847 INFO  ~ Got user-level KeeperException when processing 
sessionid:0x154aa89febb0005 type:create cxid:0xc zxid:0x17 txntype:-1 
reqpath:n/a Error Path:/solr/overseer Error:KeeperErrorCode = NodeExists for 
/solr/overseer
15:34:11,847 INFO  ~ Got user-level KeeperException when processing 
sessionid:0x154aa89febb0006 type:create cxid:0xc zxid:0x18 txntype:-1 
reqpath:n/a Error Path:/solr/overseer Error:KeeperErrorCode = NodeExists for 
/solr/overseer
15:34:11,860 INFO  ~ Got user-level KeeperException when processing 
sessionid:0x154aa89febb0005 type:create cxid:0xd zxid:0x19 txntype:-1 
reqpath:n/a Error Path:/solr/overseer Error:KeeperErrorCode = NodeExists for 
/solr/overseer
15:34:11,860 INFO  ~ Got user-level KeeperException when processing 
sessionid:0x154aa89febb0006 type:create cxid:0xd zxid:0x1a txntype:-1 
reqpath:n/a Error Path:/solr/overseer Error:KeeperErrorCode = NodeExists for 
/solr/overseer
15:34:11,885 INFO  ~ Got user-level KeeperException when processing 
sessionid:0x154aa89febb0005 type:create cxid:0xf zxid:0x1b txntype:-1 
reqpath:n/a Error Path:/solr/overseer Error:KeeperErrorCode = NodeExists for 
/solr/overseer
15:34:11,886 INFO  ~ Got user-level KeeperException when processing 
sessionid:0x154aa89febb0006 type:create cxid:0xf zxid:0x1c txntype:-1 
reqpath:n/a Error Path:/solr/overseer Error:KeeperErrorCode = NodeExists for 
/solr/overseer
15:34:11,907 INFO  ~ makePath: /overseer/collection-map-running
15:34:11,908 INFO  ~ makePath: /overseer/collection-map-running
15:34:11,920 INFO  ~ Got user-level KeeperException when processing 
sessionid:0x154aa89febb0006 type:create cxid:0x13 zxid:0x1e txntype:-1 
reqpath:n/a Error Path:/solr/overseer/collection-map-running 
Error:KeeperErrorCode = NodeExists for /solr/overseer/collection-map-running
15:34:11,921 INFO  ~ Got user-level KeeperException when processing 
sessionid:0x154aa89febb0005 type:create cxid:0x15 zxid:0x1f txntype:-1 
reqpath:n/a Error Path:/solr/overseer Error:KeeperErrorCode = NodeExists for 
/solr/overseer
15:34:11,932 INFO  ~ Got user-level KeeperException when processing 
sessionid:0x154aa89febb0006 type:create cxid:0x14 zxid:0x20 txntype:-1 
reqpath:n/a Error Path:/solr/overseer Error:KeeperErrorCode = NodeExists for 
/solr/overseer

Re: URL parameters combined with text param

2016-05-13 Thread Ahmet Arslan

Hi,

In the first debug query response, special words are also queries so it is not 
working.
Not sure edismax query parser recognizes _query_ field. But lucene query parser 
does.
Try to switch to lucene query parser.

Also if you can divide your query words into q and fq below will work:

q=hospital=lucene={!lucene q.op=AND v=$a}=Leapfrog


Ahmet
On Friday, May 13, 2016 9:01 AM, Bastien Latard - MDPI AG 
 wrote:



Thanks both!

I already tried "=true", but it doesn't tell me that much...Or at 
least, I don't see any problem...
Below are the responses...

1. /select?q=hospital AND_query_:"{!q.op=AND 
v=$a}"=abstract,title=hospital Leapfrog=true



0
280

 hospital AND_query_:"{!q.op=AND v=$a}"
 hospital Leapfrog
 true
 abstract,title




 hospital AND_query_:"{!q.op=AND v=$a}"
 hospital AND_query_:"{!q.op=AND v=$a}"
 (+(DisjunctionMaxQuery((abstract:hospit | 
title:hospit | authors:hospital | doi:hospital)) 
DisjunctionMaxQuery(((Synonym(abstract:and abstract:andqueri) 
abstract:queri) | (Synonym(title:and title:andqueri) title:queri) | 
(Synonym(authors:and authors:andquery) authors:query) | 
doi:and_query_:)) DisjunctionMaxQuery((abstract:"(q qopand) op and (v 
va) a" | title:"(q qopand) op and (v va) a" | authors:"(q qopand) op and 
(v va) a" | doi:"{!q.op=and v=$a}"/no_coord
 +((abstract:hospit | title:hospit 
| authors:hospital | doi:hospital) ((Synonym(abstract:and 
abstract:andqueri) abstract:queri) | (Synonym(title:and title:andqueri) 
title:queri) | (Synonym(authors:and authors:andquery) authors:query) | 
doi:and_query_:) (abstract:"(q qopand) op and (v va) a" | title:"(q 
qopand) op and (v va) a" | authors:"(q qopand) op and (v va) a" | 
doi:"{!q.op=and v=$a}")

ExtendedDismaxQParser





[...]





2. /select?q=_query_:"{!q.op=AND v='hospital'}"+_query_:"{!q.op=AND 
v=$a}"=hospital Leapfrog=true



   0
   2
   
 _query_:"{!q.op=AND v='hospital'}" 
_query_:"{!q.op=AND v=$a}"
 hospital Leapfrog
 true
 true
   




   _query_:"{!q.op=AND v='hospital'}" 
_query_:"{!q.op=AND v=$a}"
   _query_:"{!q.op=AND v='hospital'}" 
_query_:"{!q.op=AND v=$a}"
   (+())/no_coord
   +()
   
   ExtendedDismaxQParser
   
   
   
   
   
 [...]
   



On 12/05/2016 17:06, Erick Erickson wrote:
> Try adding =query to your query and look at the parsed results.
> This shows you exactly what Solr sees rather than what you think
> it should.
>
> Best,
> Erick
>
> On Thu, May 12, 2016 at 6:24 AM, Ahmet Arslan  
> wrote:
>> Hi,
>>
>> Well, what happens
>>
>> q=hospital={!lucene q.op=AND v=$a}=hospital Leapfrog
>>
>> OR
>>
>> q=+_query_:"{!lucene q.op=AND v='hospital'}" +_query_:"{!lucene q.op=AND 
>> v=$a}"=hospital Leapfrog
>>
>>
>> Ahmet
>>
>>
>> On Thursday, May 12, 2016 3:28 PM, Bastien Latard - MDPI AG 
>>  wrote:
>> Hi Ahmet,
>>
>> Thanks for your answer, but this doesn't work on my local index.
>> q1 returns 2 results.
>>
>> http://localhost:8983/solr/my_core/select?q=hospital AND
>> _query_:"{!q.op=AND%20v=$a}"=abstract,title=hospital Leapfrog
>> ==> returns 254 results (the same as
>> http://localhost:8983/solr/my_core/select?q=hospital )
>>
>> Kind regards,
>> Bastien
>>
>> On 11/05/2016 16:06, Ahmet Arslan wrote:
>>> Hi Bastien,
>>>
>>> Please use magic _query_ field, q=hospital AND _query_:"{!q.op=AND v=$a}"
>>>
>>> ahmet
>>>
>>>
>>> On Wednesday, May 11, 2016 2:35 PM, Latard - MDPI AG 
>>>  wrote:
>>> Hi Everybody,
>>>
>>> Is there a way to pass only some of the data by reference and some
>>> others in the q param?
>>>
>>> e.g.:
>>>
>>> q1.   http://localhost:8983/solr/my_core/select?{!q.op=OR
>>> v=$a}=abstract,title=hospital Leapfrog=true
>>>
>>> q1a.  http://localhost:8983/solr/my_core/select?q=hospital AND
>>> Leapfrog=abstract,title
>>>
>>> q2.  http://localhost:8983/solr/my_core/select?q=hospital AND
>>> ({!q.op=AND v=$a})=abstract,title=hospital Leapfrog
>>>
>>> q1 & q1a  are returning the same results, but q2 is somehow not
>>> analyzing the $a parameter properly...
>>>
>>> Am I missing anything?
>>>
>>> Kind regards,
>>> Bastien Latard
>>> Web engineer
>>
>> Kind regards,
>> Bastien Latard
>> Web engineer
>> --
>> MDPI AG
>> Postfach, CH-4005 Basel, Switzerland
>> Office: Klybeckstrasse 64, CH-4057
>> Tel. +41 61 683 77 35
>> Fax: +41 61 302 89 18
>> E-mail:
>> lat...@mdpi.com
>> http://www.mdpi.com/


Kind regards,
Bastien Latard
Web engineer
-- 
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/

Re: Is there an equivalent to an SQL "select distinct" in Solr

2016-05-13 Thread Joel Bernstein

You may also want to try out the SQL interface in Solr 6.0 which supports
SELECT DISTINCT queries.

https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface#ParallelSQLInterface-SELECTDISTINCTQueries

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, May 13, 2016 at 9:47 AM, GW  wrote:

> Thank you Shawn,
>
> I will toy with these over the weekend. Solr/Hadoop/Hbase has been a nasty
> learning curve for me,
> It would probably would have been a lot easier if I didn't have 30 years of
> RDBMS stuck in my head.
>
> Again,
>
> Many thanks for your response.
>
>
> On 13 May 2016 at 08:57, Shawn Heisey  wrote:
>
> > On 5/13/2016 6:48 AM, GW wrote:
> > > Let's say I have 10,000 documents and there is a field named "category"
> > and
> > > lets say there are 200 categories but I do not know what they are.
> > >
> > > My question: Is there a query/filter that can pull a list of distinct
> > > categories?
> >
> > Sounds like a job for faceting or grouping.  Which one of them to use
> > will depend on exactly what you're trying to obtain in your results.
> >
> > https://cwiki.apache.org/confluence/display/solr/Faceting
> > https://cwiki.apache.org/confluence/display/solr/Result+Grouping
> >
> > Thanks,
> > Shawn
> >
> >
>

Re: Is there an equivalent to an SQL "select distinct" in Solr

2016-05-13 Thread GW

Thank you Shawn,

I will toy with these over the weekend. Solr/Hadoop/Hbase has been a nasty
learning curve for me,
It would probably would have been a lot easier if I didn't have 30 years of
RDBMS stuck in my head.

Again,

Many thanks for your response.


On 13 May 2016 at 08:57, Shawn Heisey  wrote:

> On 5/13/2016 6:48 AM, GW wrote:
> > Let's say I have 10,000 documents and there is a field named "category"
> and
> > lets say there are 200 categories but I do not know what they are.
> >
> > My question: Is there a query/filter that can pull a list of distinct
> > categories?
>
> Sounds like a job for faceting or grouping.  Which one of them to use
> will depend on exactly what you're trying to obtain in your results.
>
> https://cwiki.apache.org/confluence/display/solr/Faceting
> https://cwiki.apache.org/confluence/display/solr/Result+Grouping
>
> Thanks,
> Shawn
>
>

Re: Need Help with Solr 6.0 Cross Data Center Replication

2016-05-13 Thread Oliver Rudolph

Hi,

I had the same problem. The documentation is kind of missleading here. You 
must not add a new  element to your config but update the 
existing . All you need to do is add the 
class="solr.CdcrUpdateLog" element to the  element inside your 
existing . Hope this helps!


Mit freundlichen Grüßen / Kind regards

Oliver Rudolph

IBM Deutschland Research & Development GmbH 
Vorsitzender des Aufsichtsrats: Martina Koederitz 
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, 
HRB 243294

Re: dtSearch parser & Introduction

2016-05-13 Thread Charlie Hull

On 13/05/2016 10:41, Charlie Hull wrote:

On 12/05/2016 23:50, Brandon Miller wrote:

Hello, all! I'm a BloombergBNA employee and need to obtain/write a
dtSearch parser for solr (and probably a bunch of other things a little
later).
I've looked at the available parsers and thought that the surround parser
may do the trick, but it apparently doesn't like nested N or W
subqueries.
I looked at XmlQueryParser and I'm most impressed with it from a
functionality perspective. I liked the SpanQueries, but I either don't
understand SpanNot or it has a bug for the exclude.
At the end of the day, we will need to continue to support dtSearch
syntax. I may as well just bite the bullet and write the dtSearch parser
and include it as a patch for Solr.

Hi Brandon,

We have a version of a dtSearch/Lucene query parser written a few years
ago:
http://www.flax.co.uk/blog/2012/04/24/dtsolr-an-open-source-replacement-for-the-dtsearch-closed-source-search-engine/

...and I've just blogged about some of the issues one can run into with
this sort of project, hope this is useful!

http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/

Cheers

Charlie

It would need some work to bring it up to date with the latest version
of Solr (which is why we're not offering it for download any more), but
it would save you a lot of time. We've also built parsers for Verity's
query language and some others - just so you're warned, writing parsers
isn't an easy task for a beginner, often to support what looks like a
simple query in your old language can involve some quite complex work on
the Lucene side.

Best

Charlie

Here are my immediate issues:
- I don't know the best path forward on making the parser (I saw
something in the HowToContribute page at the bottom about JFlex) - Can
someone please take pity on me and help me get started down this path? I
probably won't need a lot of help.
- I'm great at .NET, not so much Java--yet. I've not yet been able to
build a trunk and "deploy" it (I can build it and run tests, but not run
it--I'm sure I'm just missing an elusive documentation link on how to do
that)
- I downloaded and got the solr trunk in Eclipse. I'm not sure the
best
way of adding unit tests for my stuff--do I add it to an existing
subdirectory or create a new package?

I think it'd be great if I could get a bare-bones example of a parser so
that I can modify it--perhaps even keeping it in a separate Java project.

Don't feel like you have to answer all of my questions--an answer to
any of
them would be quite helpful.

Thank you guys and God bless!

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
web: www.flax.co.uk

Re: Is there an equivalent to an SQL "select distinct" in Solr

2016-05-13 Thread Shawn Heisey

On 5/13/2016 6:48 AM, GW wrote:
> Let's say I have 10,000 documents and there is a field named "category" and
> lets say there are 200 categories but I do not know what they are.
>
> My question: Is there a query/filter that can pull a list of distinct
> categories?

Sounds like a job for faceting or grouping.  Which one of them to use
will depend on exactly what you're trying to obtain in your results.

https://cwiki.apache.org/confluence/display/solr/Faceting
https://cwiki.apache.org/confluence/display/solr/Result+Grouping

Thanks,
Shawn

Re: http request to MiniSolrCloudCluster

2016-05-13 Thread Shawn Heisey

On 5/13/2016 2:26 AM, Rohana Rajapakse wrote:
> Hmmm. I now get the following errors when trying to access my Mini cluster 
> over http:
>
> 09:13:19,611 WARN  ~ Exception causing close of session 0x0 due to 
> java.io.IOException: Len error 1347375956
> 09:13:19,611 INFO  ~ Closed socket connection for client /127.0.0.1:23244 (no 
> session established for client)

The length that it is complaining about is a very large number -- 1.3
billion.

The error message excludes the detail I would need to learn whether it
comes from Zookeeper or Solr.  If it's coming from zookeeper, your
zookeeper database contains a node whose size is over a thousand times
larger than what zookeeper supports by default.

The default maximum for znode size is about one megabyte.  If you're
running a recent version of Solr (5.x or later), the most likely culprit
for a large znode is the overseer queue ... but to reach a overseer
queue size of 1.3 billion bytes would probably require an extremely
large cluster with thousands of cores, and that cluster would likely be
experiencing a lot of stability issues.  This is a situation that even a
full Solr install has trouble handling -- I would never try to do it
with a test class like MiniSolrCloudCluster.

I could also be completely wrong, but without detailed logs and more
information on what you're trying to build, it's impossible for me to say.

Thanks,
Shawn

Is there an equivalent to an SQL "select distinct" in Solr

2016-05-13 Thread GW

Let's say I have 10,000 documents and there is a field named "category" and
lets say there are 200 categories but I do not know what they are.

My question: Is there a query/filter that can pull a list of distinct
categories?

Thanks in advance,

GW

Re: Fwd: Solr Cloud 6.0.0 hangs when creating large amount of collections and node fails to recover after restart

2016-05-13 Thread Shawn Heisey

On 5/13/2016 2:19 AM, Horváth Péter Gergely wrote:
> Thank you for your feedback, I much appreciate your inputs. I don't > have 
> strong requirements regarding structuring the data: do you think
> I could use a single, relatively large collection with some >
discriminator field instead of multiple thousands of separate >
collections?
The answer is probably yes, but since I do not know anything about your
data or how you're going to need to use it, I cannot be confident in
saying that.

If all your collections will be using very simialar schemas and configs,
adding a field that you can filter on to produce the same results as a
smaller individual collection can allow you to combine documents into a
larger collection.  You would need to populate that field with a useful
value when indexing.

Thanks,
Shawn

RE: dtSearch parser & Introduction

2016-05-13 Thread Allison, Timothy B.

Depending on your needs, you might want to take a look at my SpanQueryParser 
(LUCENE-5205/SOLR-5410).  It does not offer dtsearch syntax, but if the 
SurroundQueryParser was close enough, this parser may be of use.  If you need 
modifications to it, let me know.  I'm in the process of adding 
SpanPositionRangeQuery syntax.

If you need to roll your own, beware, it is not a trivial task.  The 
SimpleQueryParser might offer the cleanest example to build on top of.

Working versions of LUCENE-5205/SpanQueryParser are available on my github 
site.  If you are using Lucene/Solr 5.5, for example, go to this branch:

https://github.com/tballison/lucene-addons/tree/lucene5.5-0.1

-Original Message-

From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Friday, May 13, 2016 5:41 AM
To: solr-user@lucene.apache.org
Subject: Re: dtSearch parser & Introduction

On 12/05/2016 23:50, Brandon Miller wrote:
> Hello, all!  I'm a BloombergBNA employee and need to obtain/write a 
> dtSearch parser for solr (and probably a bunch of other things a 
> little later).
> I've looked at the available parsers and thought that the surround 
> parser may do the trick, but it apparently doesn't like nested N or W 
> subqueries.
> I looked at XmlQueryParser and I'm most impressed with it from a 
> functionality perspective.  I liked the SpanQueries, but I either 
> don't understand SpanNot or it has a bug for the exclude.
> At the end of the day, we will need to continue to support dtSearch 
> syntax.  I may as well just bite the bullet and write the dtSearch 
> parser and include it as a patch for Solr.

Hi Brandon,

We have a version of a dtSearch/Lucene query parser written a few years
ago: 
http://www.flax.co.uk/blog/2012/04/24/dtsolr-an-open-source-replacement-for-the-dtsearch-closed-source-search-engine/

It would need some work to bring it up to date with the latest version of Solr 
(which is why we're not offering it for download any more), but it would save 
you a lot of time. We've also built parsers for Verity's query language and 
some others - just so you're warned, writing parsers isn't an easy task for a 
beginner, often to support what looks like a simple query in your old language 
can involve some quite complex work on the Lucene side.

Best

Charlie

>
> Here are my immediate issues:
>- I don't know the best path forward on making the parser (I saw 
> something in the HowToContribute page at the bottom about JFlex)  -  
> Can someone please take pity on me and help me get started down this 
> path?  I probably won't need a lot of help.
>- I'm great at .NET, not so much Java--yet.  I've not yet been able 
> to build a trunk and "deploy" it (I can build it and run tests, but 
> not run it--I'm sure I'm just missing an elusive documentation link on 
> how to do
> that)
>- I downloaded and got the solr trunk in Eclipse.  I'm not sure the 
> best way of adding unit tests for my stuff--do I add it to an existing 
> subdirectory or create a new package?
>
> I think it'd be great if I could get a bare-bones example of a parser 
> so that I can modify it--perhaps even keeping it in a separate Java project.
>
> Don't feel like you have to answer all of my questions--an answer to 
> any of them would be quite helpful.
>
> Thank you guys and God bless!
>

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: dtSearch parser & Introduction

2016-05-13 Thread Charlie Hull

On 12/05/2016 23:50, Brandon Miller wrote:

Hello, all! I'm a BloombergBNA employee and need to obtain/write a
dtSearch parser for solr (and probably a bunch of other things a little
later).
I've looked at the available parsers and thought that the surround parser
may do the trick, but it apparently doesn't like nested N or W subqueries.
I looked at XmlQueryParser and I'm most impressed with it from a
functionality perspective. I liked the SpanQueries, but I either don't
understand SpanNot or it has a bug for the exclude.
At the end of the day, we will need to continue to support dtSearch
syntax. I may as well just bite the bullet and write the dtSearch parser
and include it as a patch for Solr.

Hi Brandon,

We have a version of a dtSearch/Lucene query parser written a few years
ago:
http://www.flax.co.uk/blog/2012/04/24/dtsolr-an-open-source-replacement-for-the-dtsearch-closed-source-search-engine/

Best

Charlie

Here are my immediate issues:
- I don't know the best path forward on making the parser (I saw
something in the HowToContribute page at the bottom about JFlex) - Can
someone please take pity on me and help me get started down this path? I
probably won't need a lot of help.
- I'm great at .NET, not so much Java--yet. I've not yet been able to
build a trunk and "deploy" it (I can build it and run tests, but not run
it--I'm sure I'm just missing an elusive documentation link on how to do
that)
- I downloaded and got the solr trunk in Eclipse. I'm not sure the best
way of adding unit tests for my stuff--do I add it to an existing
subdirectory or create a new package?

I think it'd be great if I could get a bare-bones example of a parser so
that I can modify it--perhaps even keeping it in a separate Java project.

Don't feel like you have to answer all of my questions--an answer to any of
them would be quite helpful.

Thank you guys and God bless!

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
web: www.flax.co.uk

RE: http request to MiniSolrCloudCluster

2016-05-13 Thread Rohana Rajapakse

Hmmm. I now get the following errors when trying to access my Mini cluster over 
http:

09:13:19,611 WARN  ~ Exception causing close of session 0x0 due to 
java.io.IOException: Len error 1347375956
09:13:19,611 INFO  ~ Closed socket connection for client /127.0.0.1:23244 (no 
session established for client)


I have a working code in which I start up a mini cluster and use it with the 
SolrClient obtained from the code line  miniCluster.getSolrClient(). This works 
fine.
Then I tried stopping the code execution after the mini cluster was started up, 
and then tried the following cURL commands. They all give the same error  given 
above:

curl -v -i -X POST 127.0.0.1:23052/solr/admin/cores?action=status
curl -v -i -X POST http://127.0.0.1:23052/solr/admin/cores?action=status
curl -v -i -X POST localhost:23052/solr/admin/cores?action=status

cURL response is:

* About to connect() to 127.0.0.1 port 23052 (#0)
*   Trying 127.0.0.1...
* Adding handle: conn: 0x457530
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x457530) send_pipe: 1, recv_pipe: 0
* Connected to 127.0.0.1 (127.0.0.1) port 23052 (#0)
> POST /solr/admin/cores?action=status HTTP/1.1
> User-Agent: curl/7.33.0
> Host: 127.0.0.1:23052
> Accept: */*
>
* Empty reply from server
* Connection #0 to host 127.0.0.1 left intact
curl: (52) Empty reply from server


-Original Message-
From: Rohana Rajapakse [mailto:rohana.rajapa...@gossinteractive.com] 
Sent: 12 May 2016 15:31
To: solr-user@lucene.apache.org
Subject: RE: http request to MiniSolrCloudCluster

You are correct Alan. My cluster seems to have not started correctly. Debugging 
now...

Thanks

-Original Message-
From: Alan Woodward [mailto:a...@flax.co.uk] 
Sent: 12 May 2016 13:18
To: solr-user@lucene.apache.org
Subject: Re: http request to MiniSolrCloudCluster

Are you sure that the cluster is running properly?  Probably worth checking its 
logs to make sure Solr has started correctly?

Alan Woodward
www.flax.co.uk


On 12 May 2016, at 12:48, Rohana Rajapakse wrote:

> Wait.
> With correct port, curl says : "curl: (52) Empty reply from server"
> 
> 
> -Original Message-
> From: Alan Woodward [mailto:a...@flax.co.uk] 
> Sent: 12 May 2016 11:35
> To: solr-user@lucene.apache.org
> Subject: Re: http request to MiniSolrCloudCluster
> 
> Hi Rohana,
> 
> What error messages do you get from curl?  MiniSolrCloudCluster just runs 
> jetty, so you ought to be able to talk to it over HTTP.
> 
> Alan Woodward
> www.flax.co.uk
> 
> 
> On 12 May 2016, at 09:36, Rohana Rajapakse wrote:
> 
>> Hi,
>> 
>> Is it possible to make http requests (e.g. from cURL) to an active/running  
>> MiniSolrCloudCluster?
>> One of my existing projects use http requests to an EmbeddedSolrServer. Now 
>> I am migrating to Solr-6/7 and trying to use MiniSolrCloudCluster. I have 
>> got a MiniSolrCloudCluster up and running, but existing requests fails to 
>> talk to my MiniSolrCloudCluster  using the url 
>> http://127.0.0.1:6028/solr/minicluster.
>> Even the ping requests to this MiniSolrCloudCluster fails: 
>> http://127.0.0.1:6028/solr/minicluster/admin/ping?wt=json=true=true
>> 
>> Can someone please shed some light on this please?
>> 
>> Rohana
>> 
>> 
>> Registered Office: 24 Darklake View, Estover, Plymouth, PL6 7TL.
>> Company Registration No: 3553908
>> 
>> This email contains proprietary information, some or all of which may be 
>> legally privileged. It is for the intended recipient only. If an addressing 
>> or transmission error has misdirected this email, please notify the author 
>> by replying to this email. If you are not the intended recipient you may not 
>> use, disclose, distribute, copy, print or rely on this email.
>> 
>> Email transmission cannot be guaranteed to be secure or error free, as 
>> information may be intercepted, corrupted, lost, destroyed, arrive late or 
>> incomplete or contain viruses. This email and any files attached to it have 
>> been checked with virus detection software before transmission. You should 
>> nonetheless carry out your own virus check before opening any attachment. 
>> GOSS Interactive Ltd accepts no liability for any loss or damage that may be 
>> caused by software viruses.
>> 
>> 
>

Re: Fwd: Solr Cloud 6.0.0 hangs when creating large amount of collections and node fails to recover after restart

2016-05-13 Thread Horváth Péter Gergely

Hi Shawn,

Thank you for your feedback, I much appreciate your inputs. I don't have
strong requirements regarding structuring the data: do you think I could
use a single, relatively large collection with some discriminator field
instead of multiple thousands of separate collections?

Thanks,
Peter


2016-05-12 20:30 GMT+02:00 Shawn Heisey :

> On 5/12/2016 9:08 AM, Horváth Péter Gergely wrote:
> > As part of benchmark, I attempted to create about 2500 collections to
> > see how well that would work for us. Unfortunately, the experiment
> > yielded some disappointing results, after about 2000 being created
> > SolR got hung; REST requests started failing. I found the following in
> > the logs:
>
> Solr will not handle that many collections very well.  You're pushing
> the boundaries of scalability.  See this issue that I created:
>
> https://issues.apache.org/jira/browse/SOLR-7191
>
> Are you creating the collections sequentially, or running multiple
> CREATE actions simultaneously?  Sequentially, where you wait for a
> previous CREATE to complete before executing another one, is strongly
> advised.
>
> SolrCloud starts to have serious problems when you create a lot of
> collections.  We are aware of the scalability issues, but they are not
> easy to fix.
>
> Thanks,
> Shawn
>
>

mmseg4j cause error in Solr 6.0.0

2016-05-13 Thread scott.chu


Previously I make an configset with mmseg4j tokenizer and create a core on Solr 
5.4.1 under Win7. It's successfully.
Today I repeat same steps under Solr 6.0.0. When I crate collection, it return 
the error message:

*
ERROR: Failed to create collection 'cloud_ugna' due to: 
{10.18.1.81:7574_solr=org.apache.solr.client.solrj.impl.HttpSo
lrClient$RemoteSolrException:Error from server at http://10.18.1.81:7574/solr: 
Error CREATEing SolrCore 'cloud_ugna_sh
ard1_replica1': Unable to create core [cloud_ugna_shard1_replica1] Caused by: 
com.chenlb.mmseg4j.solr.MMSegTokenizerFact
ory, 
10.18.1.81:8983_solr=org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
 from server at ht
tp://10.18.1.81:8983/solr: Error CREATEing SolrCore 
'cloud_ugna_shard1_replica2': Unable to create core [cloud_ugna_sh
ard1_replica2] Caused by: com.chenlb.mmseg4j.solr.MMSegTokenizerFactory}
*

The steps are as follow:
1.Ddownload mmseg4j v2.3 from here: https://github.com/chenlb/mmseg4j-solr
2.Copy two lib jar, mmseg4j-core-1.10.0-tcdic.jar and mmseg4j-solr-2.3.0.jar 
into server\solr\solr-webapp\webapp\WEB-INF\lib.
3.Copy the mmseh4j-related config part from my Solr 5.4.1's config (the mmseg4j 
run on there successfully) into solrconfig.xml on Solr 6.0.0 side.
4.run with bin\solr create_collection -c cloud_ugna -d 
d:\solr\myconfigset\cloud_ugna -z localhost:9983 -p 8983

the part added into solrconfig.xml is:

**


mydic
true
false
simple




What could be the cause? Is it the version compatible problem between Solr 6 
and mmseg4j v2.3?


BTW, in Solr 6.0.0, the example solconfig.xml has no  tag, I 
just add following line on my own:

since my config on 5.4.1 is using schema.xml.

Could the problem come from this modification?

Re: URL parameters combined with text param

2016-05-13 Thread Bastien Latard - MDPI AG


Thanks both!

I already tried "=true", but it doesn't tell me that much...Or at 
least, I don't see any problem...

Below are the responses...

1. /select?q=hospital AND_query_:"{!q.op=AND 
v=$a}"=abstract,title=hospital Leapfrog=true




0
280

hospital AND_query_:"{!q.op=AND v=$a}"
hospital Leapfrog
true
abstract,title




hospital AND_query_:"{!q.op=AND v=$a}"
hospital AND_query_:"{!q.op=AND v=$a}"
(+(DisjunctionMaxQuery((abstract:hospit | 
title:hospit | authors:hospital | doi:hospital)) 
DisjunctionMaxQuery(((Synonym(abstract:and abstract:andqueri) 
abstract:queri) | (Synonym(title:and title:andqueri) title:queri) | 
(Synonym(authors:and authors:andquery) authors:query) | 
doi:and_query_:)) DisjunctionMaxQuery((abstract:"(q qopand) op and (v 
va) a" | title:"(q qopand) op and (v va) a" | authors:"(q qopand) op and 
(v va) a" | doi:"{!q.op=and v=$a}"/no_coord
+((abstract:hospit | title:hospit 
| authors:hospital | doi:hospital) ((Synonym(abstract:and 
abstract:andqueri) abstract:queri) | (Synonym(title:and title:andqueri) 
title:queri) | (Synonym(authors:and authors:andquery) authors:query) | 
doi:and_query_:) (abstract:"(q qopand) op and (v va) a" | title:"(q 
qopand) op and (v va) a" | authors:"(q qopand) op and (v va) a" | 
doi:"{!q.op=and v=$a}")


ExtendedDismaxQParser





   [...]





2. /select?q=_query_:"{!q.op=AND v='hospital'}"+_query_:"{!q.op=AND 
v=$a}"=hospital Leapfrog=true




  0
  2
  
_query_:"{!q.op=AND v='hospital'}" 
_query_:"{!q.op=AND v=$a}"

hospital Leapfrog
true
true
  




  _query_:"{!q.op=AND v='hospital'}" 
_query_:"{!q.op=AND v=$a}"
  _query_:"{!q.op=AND v='hospital'}" 
_query_:"{!q.op=AND v=$a}"

  (+())/no_coord
  +()
  
  ExtendedDismaxQParser
  
  
  
  
  
[...]
  



On 12/05/2016 17:06, Erick Erickson wrote:

Try adding =query to your query and look at the parsed results.
This shows you exactly what Solr sees rather than what you think
it should.

Best,
Erick

On Thu, May 12, 2016 at 6:24 AM, Ahmet Arslan  wrote:

Hi,

Well, what happens

q=hospital={!lucene q.op=AND v=$a}=hospital Leapfrog

OR

q=+_query_:"{!lucene q.op=AND v='hospital'}" +_query_:"{!lucene q.op=AND 
v=$a}"=hospital Leapfrog


Ahmet


On Thursday, May 12, 2016 3:28 PM, Bastien Latard - MDPI AG 
 wrote:
Hi Ahmet,

Thanks for your answer, but this doesn't work on my local index.
q1 returns 2 results.

http://localhost:8983/solr/my_core/select?q=hospital AND
_query_:"{!q.op=AND%20v=$a}"=abstract,title=hospital Leapfrog
==> returns 254 results (the same as
http://localhost:8983/solr/my_core/select?q=hospital )

Kind regards,
Bastien

On 11/05/2016 16:06, Ahmet Arslan wrote:

Hi Bastien,

Please use magic _query_ field, q=hospital AND _query_:"{!q.op=AND v=$a}"

ahmet


On Wednesday, May 11, 2016 2:35 PM, Latard - MDPI AG  
wrote:
Hi Everybody,

Is there a way to pass only some of the data by reference and some
others in the q param?

e.g.:

q1.   http://localhost:8983/solr/my_core/select?{!q.op=OR
v=$a}=abstract,title=hospital Leapfrog=true

q1a.  http://localhost:8983/solr/my_core/select?q=hospital AND
Leapfrog=abstract,title

q2.  http://localhost:8983/solr/my_core/select?q=hospital AND
({!q.op=AND v=$a})=abstract,title=hospital Leapfrog

q1 & q1a  are returning the same results, but q2 is somehow not
analyzing the $a parameter properly...

Am I missing anything?

Kind regards,
Bastien Latard
Web engineer


Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/


Kind regards,
Bastien Latard
Web engineer
--
MDPI AG
Postfach, CH-4005 Basel, Switzerland
Office: Klybeckstrasse 64, CH-4057
Tel. +41 61 683 77 35
Fax: +41 61 302 89 18
E-mail:
lat...@mdpi.com
http://www.mdpi.com/

38 matches

Mail list logo