date:20140707

Re: Disable all caches in Solr

2014-07-07 Thread vidit.asthana

Thanks Chris. I understand this. But this test is to determine the *maximum*
latency a query can have and hence I have disabled all caches. 

After disabling all caches in solrconfig, I was able to remove "latency
variation" for a single query in most of the cases. But still *sort* queries
are showing variation in latency when executed multiple times. Is there some
hidden cache for sorting?

When I run query below for first time, it shows higher latency, but when I
run it second time it shows lower QTime.

http://localhost:7000/solr/collection1/select?q=field1:keyword&rows=20&sort=field2
desc

*If I remove the sorting then I always get fixed QTime*. "field2" is of type
tlong.

Any ideas why this is happening and how to prevent this variation?






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Disable-all-caches-in-Solr-tp4144933p4146039.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Need of hadoop

2014-07-07 Thread Erick Erickson

And that's exactly what it means. The HdfsDirectoryFactory is intended
to use the HDFS file system to store the Solr (well, actually Lucene)
index. It's (by default), triply redundant and vastly reduces your
chances of losing your index due to disk errors. That's what HDFS
does.

If that's not useful to you, then you shouldn't use it.

This also is what the MapReduceIndexerTool is written to work with.
You can spread your M/R jobs out over all your cluster, whether or not
you're running Solr on all the nodes that can run M/R jobs. And with
the --go-live option, the final results can be merged into your live
Solr index on whichever nodes are running your Solr instances. Which
are using HdfsDirectoryFactory.

There are anecdotal reports of being able to use the
MapReduceIndexerTool to spread the indexing jobs out, then merge them
with a Solr index on a local file system. However, this is not a
supported use-case by the authors of the MapReduceIndexerTool, and I
suspect (but don't know for sure) that the native file system
implementation doesn't, for instance, make explicit use of the
MMapDirectory wrapper in Lucene. That would be a nice case to support,
but it isn't high on the contributors' priority list.

So the "bottom line" is that if the file redundancy (and associated
goodness of being able to access the data from anywhere) isn't
valuable to you, there's no particular reason to use the Solr index
over HDFS.

Best,
Erick

On Mon, Jul 7, 2014 at 9:49 PM, search engn dev
 wrote:
> It is written  here
> 
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Need-of-hadoop-tp4145846p4146033.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Need of hadoop

2014-07-07 Thread search engn dev

It is written  here
  



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Need-of-hadoop-tp4145846p4146033.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Exact Match first in the list.

2014-07-07 Thread Shawn Heisey

> HI, I HAVE A situation where applying below search rules.
>
> When I search columns for the full text search. "Product Variant Name",
> the exact match has to be in the first list and other match like , product
>  or variant or name or any combination will be next in the results.
>
> Any thoughts, why analyzer or tokenizer or filter need to use.


This is more a matter of boosting than analysis.

If you are using edismax, this is particularly easy. Just put large boost
values on the fields in the pf parameter, and you'd likely want to use the
same field list as the qf parameter.

If you are not using edismax and can construct such a query yourself, you
can boost the phrase over the individual terms. Here's a sample query:

"Product Variant Name"^10 OR (Product Variant Name)

This is essentially what edismax will do with a boost on the pf values,
except that it will work with more than one field. The edismax parser is a
wonderful creation.

Thanks,
Shawn

Re: Transparently rebalancing a Solr cluster without splitting or moving shards

2014-07-07 Thread Shawn Heisey

> Thanks Shawn, clean way to do it, indeed. And going your route, one
> could even copy the existing shards into the new collection and then
> delete the data which is getting reindexed on the new nodes. That would
> spare reindexing everything.
>
> But in my case, I add boxes after a noticeable performance degradation
> due to data volume increase. So the old boxes cannot afford reindexing
> data (or deleting if using the propose variation) in the new collection
> while serving searches with the old collection. Unless there is a way to
> bound aggressively the RAM consumption of new collection (disabling
> MMAP?), given that it's not being used for search during the transition?
> That said, even if that was possible, both collections would compete for
> disk IOs.

I don't think you'd want to disable mmap. It could be done, by choosing
another DirectoryFactory object. Adding memory is likely to be the only
sane way forward.

Another possibility would be to bump up the maxShardsPerNode value and
build the new collection (with the proper number of shards) only on the
new machines... Then when they are built, move them to their proper homes
and manually adjust the cluster state in zookeeper. This will still
generate a lot of I/O, but hopefully it will last for less time on the
wall clock, and it will be something you can do when load is low.

After that done and you've switched to it, you can add replicas with
either the addreplica collections api or with the core admin api. You
should be on the newest Solr version... Lots of bugs have been found and
fixed.

One thing I wonder is whether the MIGRATE api can be used on an entire
collection. It says it works by shard key, but I suspect that most users
will not be using that functionality.

Thanks,
Shawn

Exact Match first in the list.

2014-07-07 Thread EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions)

HI, I HAVE A situation where applying below search rules.

When I search columns for the full text search. "Product Variant Name", the 
exact match has to be in the first list and other match like , product  or 
variant or name or any combination will be next in the results.

Any thoughts, why analyzer or tokenizer or filter need to use.

Thanks

Ravi

Re: Transparently rebalancing a Solr cluster without splitting or moving shards

2014-07-07 Thread Damien Dykman

Thanks Shawn, clean way to do it, indeed. And going your route, one 
could even copy the existing shards into the new collection and then 
delete the data which is getting reindexed on the new nodes. That would 
spare reindexing everything.


But in my case, I add boxes after a noticeable performance degradation 
due to data volume increase. So the old boxes cannot afford reindexing 
data (or deleting if using the propose variation) in the new collection 
while serving searches with the old collection. Unless there is a way to 
bound aggressively the RAM consumption of new collection (disabling 
MMAP?), given that it's not being used for search during the transition? 
That said, even if that was possible, both collections would compete for 
disk IOs.


Thanks,
Damien

On 07/07/2014 12:26 PM, Shawn Heisey wrote:

On 7/7/2014 12:41 PM, Damien Dykman wrote:

I have a cluster of N boxes/nodes and I'd like to add M boxes/nodes
and rebalance data accordingly.

Lets add the following constraints:
   - 1. boxes have different characteristics (RAM, CPU, disks)
   - 2. different number of shards per box/node (lets pretend we have
found the sweet spot for each box)
   - 3. once rebalancing is over, the layout of the cluster should be
the same as if it had been bootstrapped from N+M boxes

Because of the above constraints, shard splitting or moving shards
around is not an option. And too keep the discussion simple, lets
ignore shard replicas.

So far, the best scenario I could think of is the following:
   - a. 1 collection on the N nodes using implicit routing
   - b. add shards on the M new nodes as part of that collection
   - c. reindex a portion of the data on the shards of the M new nodes,
while restricting them from search
   - d. in 1 transaction, delete the old data and immediately issue a
soft commit and remove search restrictions

You may not like this answer, but here's a fairly clean way to do this,
assuming you have enough disk space on the existing machines:

1. Add the new boxes to the cluster.
2. Create a new collection across all the boxes.
2a. If your current collection is named "test" then name the new one
 "test0" or something else that's related, but different.
3. Index all data into the new collection.
4. As quickly as possible, do the following actions:
4a. Stop indexing.
4b. Do a synchronization pass on the new collection so it's current.
4c. Delete the original collection.
4d. Create a collection alias so that you can access the new collection
 with the original collection name.
4e. Restart indexing.


Thanks,
Shawn

Re: Transparently rebalancing a Solr cluster without splitting or moving shards

2014-07-07 Thread Shawn Heisey

On 7/7/2014 12:41 PM, Damien Dykman wrote:
> I have a cluster of N boxes/nodes and I'd like to add M boxes/nodes
> and rebalance data accordingly.
>
> Lets add the following constraints:
>   - 1. boxes have different characteristics (RAM, CPU, disks)
>   - 2. different number of shards per box/node (lets pretend we have
> found the sweet spot for each box)
>   - 3. once rebalancing is over, the layout of the cluster should be
> the same as if it had been bootstrapped from N+M boxes
>
> Because of the above constraints, shard splitting or moving shards
> around is not an option. And too keep the discussion simple, lets
> ignore shard replicas.
>
> So far, the best scenario I could think of is the following:
>   - a. 1 collection on the N nodes using implicit routing
>   - b. add shards on the M new nodes as part of that collection
>   - c. reindex a portion of the data on the shards of the M new nodes,
> while restricting them from search
>   - d. in 1 transaction, delete the old data and immediately issue a
> soft commit and remove search restrictions

You may not like this answer, but here's a fairly clean way to do this,
assuming you have enough disk space on the existing machines:

1. Add the new boxes to the cluster.
2. Create a new collection across all the boxes.
2a. If your current collection is named "test" then name the new one
"test0" or something else that's related, but different.
3. Index all data into the new collection.
4. As quickly as possible, do the following actions:
4a. Stop indexing.
4b. Do a synchronization pass on the new collection so it's current.
4c. Delete the original collection.
4d. Create a collection alias so that you can access the new collection
with the original collection name.
4e. Restart indexing.

Thanks,
Shawn

Transparently rebalancing a Solr cluster without splitting or moving shards

2014-07-07 Thread Damien Dykman

I have a cluster of N boxes/nodes and I'd like to add M boxes/nodes and 
rebalance data accordingly.


Lets add the following constraints:
  - 1. boxes have different characteristics (RAM, CPU, disks)
  - 2. different number of shards per box/node (lets pretend we have 
found the sweet spot for each box)
  - 3. once rebalancing is over, the layout of the cluster should be 
the same as if it had been bootstrapped from N+M boxes


Because of the above constraints, shard splitting or moving shards 
around is not an option. And too keep the discussion simple, lets ignore 
shard replicas.


So far, the best scenario I could think of is the following:
  - a. 1 collection on the N nodes using implicit routing
  - b. add shards on the M new nodes as part of that collection
  - c. reindex a portion of the data on the shards of the M new nodes, 
while restricting them from search
  - d. in 1 transaction, delete the old data and immediately issue a 
soft commit and remove search restrictions


Any better idea?

I could also use 1 collection per box and have Solr do the routing 
within each collection. I would still have to handle the routing across 
collections but collection aliases would come in handy. But overall, it 
would be similar to the above scenario. Actually in my case, it wouldn't 
work as well because I also use some kind of "flag document" on the M 
new nodes which I need to update atomically with the delete of the old 
stuff. And, if I'm not mistaken, I'd loose atomicity with the 
multi-collection scenario.


Thank you for your feedback,
Damien

Re: Long ParNew GC pauses - even when young generation is small

2014-07-07 Thread Shawn Heisey

On 7/7/2014 10:22 AM, aferdous wrote:
> Hi Shawn - I was just wondering how did you resolve this issue in the end. We
> are seeing the same issue with our platform (similar heap size) and updater
> volume.
>
> It would be nice if you could provide us with your final findings/configs.

I use the following GC tuning parameters now, with FAR better results:

http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

The G1 collector performed worse for me than untuned CMS.  The average
collection was awesome, but when it finally did do a stop-the-world,
those were even longer.

Thanks,
Shawn

Re: Long ParNew GC pauses - even when young generation is small

2014-07-07 Thread aferdous

Hi Shawn - I was just wondering how did you resolve this issue in the end. We
are seeing the same issue with our platform (similar heap size) and updater
volume.

It would be nice if you could provide us with your final findings/configs.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Long-ParNew-GC-pauses-even-when-young-generation-is-small-tp4031110p4145938.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: java.net.SocketException: Connection reset

2014-07-07 Thread Shawn Heisey

On 7/7/2014 7:30 AM, heaven wrote:
> Yeah. the heap is huge, need to optimize the caches. It was 8Gb previously,
> had to increase because there were out of memory errors. Using
> ConcMarkSweepGC, which is supposed to not lock the world.

At one time the only thing I was using that was non-default for garbage
collection was turning on CMS.  I had VERY bad GC pauses.

I use the following settings now, and my pause problems are much
better.  There is very likely some other combination of settings that
would do even better with Solr:

http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

Thanks,
Shawn

Facets on Nested documents

2014-07-07 Thread adfel70

Hi,

I indexed different types(different fields) of child docs for every parent.
I want to do facet on field in one type of child doc and after it to do
another of facet on different type of child doc. It doesn't work..

Any idea how i can do something like that?

thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facets-on-Nested-documents-tp4145931.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Need of hadoop

2014-07-07 Thread Erick Erickson

OK, _where_ is that written? The HdfsDirectoryFactory code?
Someone's blog somewhere? Your notes?

Ali has one part of the answer, using HDFS will redundantly
store your index, which is good.

Furthermore, the MapReduceIndexerTool (see the contribs)
_will_ use HDFS to do the classic M/R indexing process
for batch, and has a --go-live feature that allows you to merge
the results into a running SolrCloud. It was written with the
assumption that the Solr index was on HDFS.

FWIW,
Erick

On Mon, Jul 7, 2014 at 6:06 AM, Ali Nazemian  wrote:
> I think this will not improve the performance of indexing but probably it
> would be a solution for using HDFS HA with replication factor. But I am not
> sure about that.
>
>
> On Mon, Jul 7, 2014 at 12:53 PM, search engn dev 
> wrote:
>
>> Currently i am exploring hadoop with solr, Somewhere it is written as "This
>> does not use Hadoop Map-Reduce to process Solr data, rather it only uses
>> the
>> HDFS filesystem for index and transaction log file storage. " ,
>>
>> then what is the advantage of using using hadoop over local file system?
>> will use of hdfs increase overall performance of searching?
>>
>> any detailed pointers regarding this will surely help me to understand
>> this.
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Need-of-hadoop-tp4145846.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> A.Nazemian

Re: java.net.SocketException: Connection reset

2014-07-07 Thread santosh sidnal

Even i am facing same issue.  AFTER doing a server restart again indexing
can run fine once,  but for second time same issue.
On 3 Jul 2014 23:37, "heaven"  wrote:

> Hi, trying DigitalOcean for Solr, everything seems well, except sometimes I
> see these errors:
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:196)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at
>
> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
> at
>
> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
> at
>
> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
> at
>
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
> at
>
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
> at
>
> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
> at
>
> org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
> at
>
> org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
> at
>
> org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
> at
>
> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
> at
>
> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
> at
>
> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
> at
>
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
> at
>
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
> at
>
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at
>
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
> at
>
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
> at
>
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
>
> 
>
> Solr version is 4.8.1, on Ubuntu Linux. We have 2 nodes, one run 2 shards
> and another 2 replicas.
>
> Errors happen during indexing process. Does it require some
> tweaks/optimizations? I have no idea where to look to fix this. Any
> suggestions are welcome.
>
> Thank you,
> Alex
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/java-net-SocketException-Connection-reset-tp4145519.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: java.net.SocketException: Connection reset

2014-07-07 Thread heaven

Yeah. the heap is huge, need to optimize the caches. It was 8Gb previously,
had to increase because there were out of memory errors. Using
ConcMarkSweepGC, which is supposed to not lock the world.

Had to disable optimize (previously we did so by a cron task) because the
index is big and optimize has bad impact on performance and resources usage.
We're using auto and soft commits only:

  25000
  30
  false 



  1


I was thinking we may reach some system limits, but netstat doesn't show
anything suspicious:
alex@solr1:~$ netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
137 CLOSE_WAIT
 24 ESTABLISHED
  9 LISTEN
 77 TIME_WAIT

It is also not clear where did those errors happen. It would be useful for
users (I mean for those not familiar with Solr development) if instead of
putting the entire backtrace (or in addition to it) Solr logged user
readable messages. Like: "Error while sending response to the client
#{client_ip:client_port}", or "when sending updates to replica
#{replica_ip:replica_port}". Because now those errors are pretty confusing.

Best,
Alex



--
View this message in context: 
http://lucene.472066.n3.nabble.com/java-net-SocketException-Connection-reset-tp4145519p4145894.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Need of hadoop

2014-07-07 Thread Ali Nazemian

I think this will not improve the performance of indexing but probably it
would be a solution for using HDFS HA with replication factor. But I am not
sure about that.


On Mon, Jul 7, 2014 at 12:53 PM, search engn dev 
wrote:

> Currently i am exploring hadoop with solr, Somewhere it is written as "This
> does not use Hadoop Map-Reduce to process Solr data, rather it only uses
> the
> HDFS filesystem for index and transaction log file storage. " ,
>
> then what is the advantage of using using hadoop over local file system?
> will use of hdfs increase overall performance of searching?
>
> any detailed pointers regarding this will surely help me to understand
> this.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Need-of-hadoop-tp4145846.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
A.Nazemian

Re: java.net.SocketException: Connection reset

2014-07-07 Thread Michael Della Bitta

I don't see anything out of the ordinary thus far, except your heap looks a
little big. I usually run with 6-7gb. I'm wondering if maybe you're running
into a juliet pause and that's causing your sockets to time out.

Have you gathered any GC stats?

Also, what are you doing with respect to commits and optimizes?



Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions  | g+:
plus.google.com/appinions

w: appinions.com 


On Fri, Jul 4, 2014 at 5:22 PM, heaven  wrote:

> Today this had happened again + this one:
> null:java.net.SocketException: Broken pipe
> at java.net.SocketOutputStream.socketWrite0(Native Method)
> at
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
> at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
> at
>
> org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:181)
> at
>
> org.apache.http.impl.io.ChunkedOutputStream.flushCache(ChunkedOutputStream.java:111)
> at
>
> org.apache.http.impl.io.ChunkedOutputStream.flush(ChunkedOutputStream.java:193)
> at
>
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner$1.writeTo(ConcurrentUpdateSolrServer.java:206)
> at
> org.apache.http.entity.EntityTemplate.writeTo(EntityTemplate.java:69)
> at
> org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:89)
> at
>
> org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
> at
>
> org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117)
> at
>
> org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265)
> at
>
> org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203)
> at
>
> org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236)
> at
>
> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121)
> at
>
> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
> at
>
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
> at
>
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
> at
>
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at
>
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
> at
>
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
> at
>
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
>
> Previously we had all 4 instances on a single node so I thought these
> errors
> might be result of high load. Like if some request taking too long to
> complete or something like that. And we always had missing docs in the
> index
> or vise verse some docs remains in the index when they shouldn't (even
> though it is supposed to recover from the log and our index queue never
> remove docs from it until it gets a successful response from Solr).
>
> But now we run shards and replicas on separate nodes with lots of resources
> and a very fast disk storage. And it still causes weird errors. It seems
> Solr is buggy as hell, that's my impression after a few years of usage. And
> it doesn't get better in this aspect, these errors follow us from the very
> beginning.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/java-net-SocketException-Connection-reset-tp4145519p4145675.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: solr dedup on specific fields

2014-07-07 Thread Ali Nazemian

Yeah, unfortunately I want it to be searchable:(



On Mon, Jul 7, 2014 at 2:23 PM, Alexandre Rafalovitch 
wrote:

> It's an interesting thought. I haven't tried those.
>
> But I don't think the EFFs are searchable. Do you need them to be
> searchable?
>
> Regards,
>Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>
>
> On Mon, Jul 7, 2014 at 4:48 PM, Ali Nazemian 
> wrote:
> > Dear Alexande,
> > What if I use ExternalFileFiled for the fields that I dont want to be
> > changed? Does that work for me?
> > Regards.
> >
> >
> > On Mon, Jul 7, 2014 at 2:05 PM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > wrote:
> >
> >> Well, let us know when you figure out a way to satisfy all your
> >> requirements.
> >>
> >> Solr is designed for a full-document replace to be efficient at it's
> >> primary function (search). Any workaround require some sort of
> >> sacrifice.
> >>
> >> Good luck,
> >>Alex.
> >> Personal website: http://www.outerthoughts.com/
> >> Current project: http://www.solr-start.com/ - Accelerating your Solr
> >> proficiency
> >>
> >>
> >> On Mon, Jul 7, 2014 at 4:32 PM, Ali Nazemian 
> >> wrote:
> >> > Updating documents will add some extra time to indexing process. (I
> send
> >> > the documents via apache Nutch) I prefer to make indexing as fast as
> >> > possible.
> >> >
> >> >
> >> > On Mon, Jul 7, 2014 at 12:05 PM, Alexandre Rafalovitch <
> >> arafa...@gmail.com>
> >> > wrote:
> >> >
> >> >> Can you use Update operation instead of Create? Then, you can supply
> >> >> only the fields that need to be changed and use atomic update to
> >> >> preserve the others. But then you will have issues when you _are_
> >> >> creating new documents and you do need to store all fields.
> >> >>
> >> >> Regards,
> >> >>Alex.
> >> >> Personal website: http://www.outerthoughts.com/
> >> >> Current project: http://www.solr-start.com/ - Accelerating your Solr
> >> >> proficiency
> >> >>
> >> >>
> >> >> On Mon, Jul 7, 2014 at 2:08 PM, Ali Nazemian 
> >> >> wrote:
> >> >> > Dears,
> >> >> > Is there any way that I can do that in other way?
> >> >> > I mean if you look at my main problem again you will find out that
> I
> >> have
> >> >> > two types of fields in my documents. 1) The ones that should be
> >> >> overwritten
> >> >> > on duplicates, 2) The ones that should not change during
> duplicates.
> >> So
> >> >> Is
> >> >> > it another way to handle this situation from the first place? I
> mean
> >> >> using
> >> >> > cross join for example?
> >> >> > Assume I have a document with ID 2 which contains all the fields
> that
> >> can
> >> >> > be overwritten. And another document with ID 2 which contains all
> >> fields
> >> >> > that should not change during duplication detection. For selecting
> all
> >> >> > fields it is enough to do join on ID and for Duplication it is
> enough
> >> to
> >> >> > overwrite just document type 1.
> >> >> > Regards.
> >> >> >
> >> >> >
> >> >> > On Tue, Jul 1, 2014 at 6:17 PM, Alexandre Rafalovitch <
> >> >> arafa...@gmail.com>
> >> >> > wrote:
> >> >> >
> >> >> >> Well, it's implemented in SignatureUpdateProcessorFactory. Worst
> >> case,
> >> >> >> you can clone that code and add your preserve-field functionality.
> >> >> >> Could even be a nice contribution.
> >> >> >>
> >> >> >> Regards,
> >> >> >>Alex.
> >> >> >>
> >> >> >> Personal website: http://www.outerthoughts.com/
> >> >> >> Current project: http://www.solr-start.com/ - Accelerating your
> Solr
> >> >> >> proficiency
> >> >> >>
> >> >> >>
> >> >> >> On Tue, Jul 1, 2014 at 6:50 PM, Ali Nazemian <
> alinazem...@gmail.com>
> >> >> >> wrote:
> >> >> >> > Any suggestion would be appreciated.
> >> >> >> > Regards.
> >> >> >> >
> >> >> >> >
> >> >> >> > On Mon, Jun 30, 2014 at 2:49 PM, Ali Nazemian <
> >> alinazem...@gmail.com>
> >> >> >> wrote:
> >> >> >> >
> >> >> >> >> Hi,
> >> >> >> >> I used solr 4.8 for indexing the web pages that come from
> nutch. I
> >> >> know
> >> >> >> >> that solr deduplication operation works on uniquekey field. So
> I
> >> set
> >> >> >> that
> >> >> >> >> to URL field. Everything is OK. except that I want after
> >> duplication
> >> >> >> >> detection solr try not to delete all fields of old document. I
> >> want
> >> >> some
> >> >> >> >> fields remain unchanged. For example assume I have a data field
> >> >> called
> >> >> >> >> "read" with Boolean value "true" for specific document. I want
> all
> >> >> >> fields
> >> >> >> >> of new document overwrites except the value of this field. Is
> that
> >> >> >> >> possible? How?
> >> >> >> >> Regards.
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> A.Nazemian
> >> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> > A.Nazemian
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > A.Nazemian
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > A.Nazemian
> >>
> >
> >
> >
> > --
> > A.Nazemian
>



-- 
A.Nazemian

Re: solr dedup on specific fields

2014-07-07 Thread Alexandre Rafalovitch

It's an interesting thought. I haven't tried those.

But I don't think the EFFs are searchable. Do you need them to be searchable?

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Mon, Jul 7, 2014 at 4:48 PM, Ali Nazemian  wrote:
> Dear Alexande,
> What if I use ExternalFileFiled for the fields that I dont want to be
> changed? Does that work for me?
> Regards.
>
>
> On Mon, Jul 7, 2014 at 2:05 PM, Alexandre Rafalovitch 
> wrote:
>
>> Well, let us know when you figure out a way to satisfy all your
>> requirements.
>>
>> Solr is designed for a full-document replace to be efficient at it's
>> primary function (search). Any workaround require some sort of
>> sacrifice.
>>
>> Good luck,
>>Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr
>> proficiency
>>
>>
>> On Mon, Jul 7, 2014 at 4:32 PM, Ali Nazemian 
>> wrote:
>> > Updating documents will add some extra time to indexing process. (I send
>> > the documents via apache Nutch) I prefer to make indexing as fast as
>> > possible.
>> >
>> >
>> > On Mon, Jul 7, 2014 at 12:05 PM, Alexandre Rafalovitch <
>> arafa...@gmail.com>
>> > wrote:
>> >
>> >> Can you use Update operation instead of Create? Then, you can supply
>> >> only the fields that need to be changed and use atomic update to
>> >> preserve the others. But then you will have issues when you _are_
>> >> creating new documents and you do need to store all fields.
>> >>
>> >> Regards,
>> >>Alex.
>> >> Personal website: http://www.outerthoughts.com/
>> >> Current project: http://www.solr-start.com/ - Accelerating your Solr
>> >> proficiency
>> >>
>> >>
>> >> On Mon, Jul 7, 2014 at 2:08 PM, Ali Nazemian 
>> >> wrote:
>> >> > Dears,
>> >> > Is there any way that I can do that in other way?
>> >> > I mean if you look at my main problem again you will find out that I
>> have
>> >> > two types of fields in my documents. 1) The ones that should be
>> >> overwritten
>> >> > on duplicates, 2) The ones that should not change during duplicates.
>> So
>> >> Is
>> >> > it another way to handle this situation from the first place? I mean
>> >> using
>> >> > cross join for example?
>> >> > Assume I have a document with ID 2 which contains all the fields that
>> can
>> >> > be overwritten. And another document with ID 2 which contains all
>> fields
>> >> > that should not change during duplication detection. For selecting all
>> >> > fields it is enough to do join on ID and for Duplication it is enough
>> to
>> >> > overwrite just document type 1.
>> >> > Regards.
>> >> >
>> >> >
>> >> > On Tue, Jul 1, 2014 at 6:17 PM, Alexandre Rafalovitch <
>> >> arafa...@gmail.com>
>> >> > wrote:
>> >> >
>> >> >> Well, it's implemented in SignatureUpdateProcessorFactory. Worst
>> case,
>> >> >> you can clone that code and add your preserve-field functionality.
>> >> >> Could even be a nice contribution.
>> >> >>
>> >> >> Regards,
>> >> >>Alex.
>> >> >>
>> >> >> Personal website: http://www.outerthoughts.com/
>> >> >> Current project: http://www.solr-start.com/ - Accelerating your Solr
>> >> >> proficiency
>> >> >>
>> >> >>
>> >> >> On Tue, Jul 1, 2014 at 6:50 PM, Ali Nazemian 
>> >> >> wrote:
>> >> >> > Any suggestion would be appreciated.
>> >> >> > Regards.
>> >> >> >
>> >> >> >
>> >> >> > On Mon, Jun 30, 2014 at 2:49 PM, Ali Nazemian <
>> alinazem...@gmail.com>
>> >> >> wrote:
>> >> >> >
>> >> >> >> Hi,
>> >> >> >> I used solr 4.8 for indexing the web pages that come from nutch. I
>> >> know
>> >> >> >> that solr deduplication operation works on uniquekey field. So I
>> set
>> >> >> that
>> >> >> >> to URL field. Everything is OK. except that I want after
>> duplication
>> >> >> >> detection solr try not to delete all fields of old document. I
>> want
>> >> some
>> >> >> >> fields remain unchanged. For example assume I have a data field
>> >> called
>> >> >> >> "read" with Boolean value "true" for specific document. I want all
>> >> >> fields
>> >> >> >> of new document overwrites except the value of this field. Is that
>> >> >> >> possible? How?
>> >> >> >> Regards.
>> >> >> >>
>> >> >> >> --
>> >> >> >> A.Nazemian
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > A.Nazemian
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > A.Nazemian
>> >>
>> >
>> >
>> >
>> > --
>> > A.Nazemian
>>
>
>
>
> --
> A.Nazemian

Re: solr dedup on specific fields

2014-07-07 Thread Ali Nazemian

Dear Alexande,
What if I use ExternalFileFiled for the fields that I dont want to be
changed? Does that work for me?
Regards.


On Mon, Jul 7, 2014 at 2:05 PM, Alexandre Rafalovitch 
wrote:

> Well, let us know when you figure out a way to satisfy all your
> requirements.
>
> Solr is designed for a full-document replace to be efficient at it's
> primary function (search). Any workaround require some sort of
> sacrifice.
>
> Good luck,
>Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>
>
> On Mon, Jul 7, 2014 at 4:32 PM, Ali Nazemian 
> wrote:
> > Updating documents will add some extra time to indexing process. (I send
> > the documents via apache Nutch) I prefer to make indexing as fast as
> > possible.
> >
> >
> > On Mon, Jul 7, 2014 at 12:05 PM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > wrote:
> >
> >> Can you use Update operation instead of Create? Then, you can supply
> >> only the fields that need to be changed and use atomic update to
> >> preserve the others. But then you will have issues when you _are_
> >> creating new documents and you do need to store all fields.
> >>
> >> Regards,
> >>Alex.
> >> Personal website: http://www.outerthoughts.com/
> >> Current project: http://www.solr-start.com/ - Accelerating your Solr
> >> proficiency
> >>
> >>
> >> On Mon, Jul 7, 2014 at 2:08 PM, Ali Nazemian 
> >> wrote:
> >> > Dears,
> >> > Is there any way that I can do that in other way?
> >> > I mean if you look at my main problem again you will find out that I
> have
> >> > two types of fields in my documents. 1) The ones that should be
> >> overwritten
> >> > on duplicates, 2) The ones that should not change during duplicates.
> So
> >> Is
> >> > it another way to handle this situation from the first place? I mean
> >> using
> >> > cross join for example?
> >> > Assume I have a document with ID 2 which contains all the fields that
> can
> >> > be overwritten. And another document with ID 2 which contains all
> fields
> >> > that should not change during duplication detection. For selecting all
> >> > fields it is enough to do join on ID and for Duplication it is enough
> to
> >> > overwrite just document type 1.
> >> > Regards.
> >> >
> >> >
> >> > On Tue, Jul 1, 2014 at 6:17 PM, Alexandre Rafalovitch <
> >> arafa...@gmail.com>
> >> > wrote:
> >> >
> >> >> Well, it's implemented in SignatureUpdateProcessorFactory. Worst
> case,
> >> >> you can clone that code and add your preserve-field functionality.
> >> >> Could even be a nice contribution.
> >> >>
> >> >> Regards,
> >> >>Alex.
> >> >>
> >> >> Personal website: http://www.outerthoughts.com/
> >> >> Current project: http://www.solr-start.com/ - Accelerating your Solr
> >> >> proficiency
> >> >>
> >> >>
> >> >> On Tue, Jul 1, 2014 at 6:50 PM, Ali Nazemian 
> >> >> wrote:
> >> >> > Any suggestion would be appreciated.
> >> >> > Regards.
> >> >> >
> >> >> >
> >> >> > On Mon, Jun 30, 2014 at 2:49 PM, Ali Nazemian <
> alinazem...@gmail.com>
> >> >> wrote:
> >> >> >
> >> >> >> Hi,
> >> >> >> I used solr 4.8 for indexing the web pages that come from nutch. I
> >> know
> >> >> >> that solr deduplication operation works on uniquekey field. So I
> set
> >> >> that
> >> >> >> to URL field. Everything is OK. except that I want after
> duplication
> >> >> >> detection solr try not to delete all fields of old document. I
> want
> >> some
> >> >> >> fields remain unchanged. For example assume I have a data field
> >> called
> >> >> >> "read" with Boolean value "true" for specific document. I want all
> >> >> fields
> >> >> >> of new document overwrites except the value of this field. Is that
> >> >> >> possible? How?
> >> >> >> Regards.
> >> >> >>
> >> >> >> --
> >> >> >> A.Nazemian
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > A.Nazemian
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > A.Nazemian
> >>
> >
> >
> >
> > --
> > A.Nazemian
>



-- 
A.Nazemian

Re: solr dedup on specific fields

2014-07-07 Thread Alexandre Rafalovitch

Well, let us know when you figure out a way to satisfy all your requirements.

Solr is designed for a full-document replace to be efficient at it's
primary function (search). Any workaround require some sort of
sacrifice.

Good luck,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Mon, Jul 7, 2014 at 4:32 PM, Ali Nazemian  wrote:
> Updating documents will add some extra time to indexing process. (I send
> the documents via apache Nutch) I prefer to make indexing as fast as
> possible.
>
>
> On Mon, Jul 7, 2014 at 12:05 PM, Alexandre Rafalovitch 
> wrote:
>
>> Can you use Update operation instead of Create? Then, you can supply
>> only the fields that need to be changed and use atomic update to
>> preserve the others. But then you will have issues when you _are_
>> creating new documents and you do need to store all fields.
>>
>> Regards,
>>Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr
>> proficiency
>>
>>
>> On Mon, Jul 7, 2014 at 2:08 PM, Ali Nazemian 
>> wrote:
>> > Dears,
>> > Is there any way that I can do that in other way?
>> > I mean if you look at my main problem again you will find out that I have
>> > two types of fields in my documents. 1) The ones that should be
>> overwritten
>> > on duplicates, 2) The ones that should not change during duplicates. So
>> Is
>> > it another way to handle this situation from the first place? I mean
>> using
>> > cross join for example?
>> > Assume I have a document with ID 2 which contains all the fields that can
>> > be overwritten. And another document with ID 2 which contains all fields
>> > that should not change during duplication detection. For selecting all
>> > fields it is enough to do join on ID and for Duplication it is enough to
>> > overwrite just document type 1.
>> > Regards.
>> >
>> >
>> > On Tue, Jul 1, 2014 at 6:17 PM, Alexandre Rafalovitch <
>> arafa...@gmail.com>
>> > wrote:
>> >
>> >> Well, it's implemented in SignatureUpdateProcessorFactory. Worst case,
>> >> you can clone that code and add your preserve-field functionality.
>> >> Could even be a nice contribution.
>> >>
>> >> Regards,
>> >>Alex.
>> >>
>> >> Personal website: http://www.outerthoughts.com/
>> >> Current project: http://www.solr-start.com/ - Accelerating your Solr
>> >> proficiency
>> >>
>> >>
>> >> On Tue, Jul 1, 2014 at 6:50 PM, Ali Nazemian 
>> >> wrote:
>> >> > Any suggestion would be appreciated.
>> >> > Regards.
>> >> >
>> >> >
>> >> > On Mon, Jun 30, 2014 at 2:49 PM, Ali Nazemian 
>> >> wrote:
>> >> >
>> >> >> Hi,
>> >> >> I used solr 4.8 for indexing the web pages that come from nutch. I
>> know
>> >> >> that solr deduplication operation works on uniquekey field. So I set
>> >> that
>> >> >> to URL field. Everything is OK. except that I want after duplication
>> >> >> detection solr try not to delete all fields of old document. I want
>> some
>> >> >> fields remain unchanged. For example assume I have a data field
>> called
>> >> >> "read" with Boolean value "true" for specific document. I want all
>> >> fields
>> >> >> of new document overwrites except the value of this field. Is that
>> >> >> possible? How?
>> >> >> Regards.
>> >> >>
>> >> >> --
>> >> >> A.Nazemian
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > A.Nazemian
>> >>
>> >
>> >
>> >
>> > --
>> > A.Nazemian
>>
>
>
>
> --
> A.Nazemian

Re: solr dedup on specific fields

2014-07-07 Thread Ali Nazemian

Updating documents will add some extra time to indexing process. (I send
the documents via apache Nutch) I prefer to make indexing as fast as
possible.


On Mon, Jul 7, 2014 at 12:05 PM, Alexandre Rafalovitch 
wrote:

> Can you use Update operation instead of Create? Then, you can supply
> only the fields that need to be changed and use atomic update to
> preserve the others. But then you will have issues when you _are_
> creating new documents and you do need to store all fields.
>
> Regards,
>Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>
>
> On Mon, Jul 7, 2014 at 2:08 PM, Ali Nazemian 
> wrote:
> > Dears,
> > Is there any way that I can do that in other way?
> > I mean if you look at my main problem again you will find out that I have
> > two types of fields in my documents. 1) The ones that should be
> overwritten
> > on duplicates, 2) The ones that should not change during duplicates. So
> Is
> > it another way to handle this situation from the first place? I mean
> using
> > cross join for example?
> > Assume I have a document with ID 2 which contains all the fields that can
> > be overwritten. And another document with ID 2 which contains all fields
> > that should not change during duplication detection. For selecting all
> > fields it is enough to do join on ID and for Duplication it is enough to
> > overwrite just document type 1.
> > Regards.
> >
> >
> > On Tue, Jul 1, 2014 at 6:17 PM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > wrote:
> >
> >> Well, it's implemented in SignatureUpdateProcessorFactory. Worst case,
> >> you can clone that code and add your preserve-field functionality.
> >> Could even be a nice contribution.
> >>
> >> Regards,
> >>Alex.
> >>
> >> Personal website: http://www.outerthoughts.com/
> >> Current project: http://www.solr-start.com/ - Accelerating your Solr
> >> proficiency
> >>
> >>
> >> On Tue, Jul 1, 2014 at 6:50 PM, Ali Nazemian 
> >> wrote:
> >> > Any suggestion would be appreciated.
> >> > Regards.
> >> >
> >> >
> >> > On Mon, Jun 30, 2014 at 2:49 PM, Ali Nazemian 
> >> wrote:
> >> >
> >> >> Hi,
> >> >> I used solr 4.8 for indexing the web pages that come from nutch. I
> know
> >> >> that solr deduplication operation works on uniquekey field. So I set
> >> that
> >> >> to URL field. Everything is OK. except that I want after duplication
> >> >> detection solr try not to delete all fields of old document. I want
> some
> >> >> fields remain unchanged. For example assume I have a data field
> called
> >> >> "read" with Boolean value "true" for specific document. I want all
> >> fields
> >> >> of new document overwrites except the value of this field. Is that
> >> >> possible? How?
> >> >> Regards.
> >> >>
> >> >> --
> >> >> A.Nazemian
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > A.Nazemian
> >>
> >
> >
> >
> > --
> > A.Nazemian
>



-- 
A.Nazemian

Re: Language detection for solr 3.6.1

2014-07-07 Thread Poornima Jay

Hi,

Please let me know if anyone had used google language detection for 
implementing multilanguage search in one schema.

Thanks,
Poornima




On Tuesday, 1 July 2014 6:54 PM, Poornima Jay  
wrote:
 


Hi,

Can anyone please let me know how to integrate 
http://code.google.com/p/language-detection/ in solr 3.6.1. I want four 
languages (English, chinese simplified, chinese traditional, Japanes, and 
Korean) to be added in one schema ie. multilingual search from single schema 
file.

I tried added solr-langdetect-3.5.0.jar in my /solr/contrib/langid/lib/ 
location and in /webapps/solr/WEB-INF/contrib/langid/lib/ and made changes in 
the solrconfig.xml as below



 
    
    
    content_eng    
    true
    content_eng,content_ja
    en,ja
    en:english ja:japanese
    en
    
    
  
  
  
    
    langid
    
  

Please suggest me the solution.

Thanks,
Poornima

Need of hadoop

2014-07-07 Thread search engn dev

Currently i am exploring hadoop with solr, Somewhere it is written as "This
does not use Hadoop Map-Reduce to process Solr data, rather it only uses the
HDFS filesystem for index and transaction log file storage. " ,

then what is the advantage of using using hadoop over local file system?
will use of hdfs increase overall performance of searching? 

any detailed pointers regarding this will surely help me to understand this.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Need-of-hadoop-tp4145846.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to index data by hadoop and solr?

2014-07-07 Thread wanggaohang


use map-reduce index solr(like the solrindexjob in nutch)
On 2014年07月07日 11:55, toothlou_t...@163.com wrote:

 Hello:
 I want to use hadoop and solr to index data, is there someone 
can tell me how to do it?


toothlou_t...@163.com

Re: solr dedup on specific fields

2014-07-07 Thread Alexandre Rafalovitch

Can you use Update operation instead of Create? Then, you can supply
only the fields that need to be changed and use atomic update to
preserve the others. But then you will have issues when you _are_
creating new documents and you do need to store all fields.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Mon, Jul 7, 2014 at 2:08 PM, Ali Nazemian  wrote:
> Dears,
> Is there any way that I can do that in other way?
> I mean if you look at my main problem again you will find out that I have
> two types of fields in my documents. 1) The ones that should be overwritten
> on duplicates, 2) The ones that should not change during duplicates. So Is
> it another way to handle this situation from the first place? I mean using
> cross join for example?
> Assume I have a document with ID 2 which contains all the fields that can
> be overwritten. And another document with ID 2 which contains all fields
> that should not change during duplication detection. For selecting all
> fields it is enough to do join on ID and for Duplication it is enough to
> overwrite just document type 1.
> Regards.
>
>
> On Tue, Jul 1, 2014 at 6:17 PM, Alexandre Rafalovitch 
> wrote:
>
>> Well, it's implemented in SignatureUpdateProcessorFactory. Worst case,
>> you can clone that code and add your preserve-field functionality.
>> Could even be a nice contribution.
>>
>> Regards,
>>Alex.
>>
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr
>> proficiency
>>
>>
>> On Tue, Jul 1, 2014 at 6:50 PM, Ali Nazemian 
>> wrote:
>> > Any suggestion would be appreciated.
>> > Regards.
>> >
>> >
>> > On Mon, Jun 30, 2014 at 2:49 PM, Ali Nazemian 
>> wrote:
>> >
>> >> Hi,
>> >> I used solr 4.8 for indexing the web pages that come from nutch. I know
>> >> that solr deduplication operation works on uniquekey field. So I set
>> that
>> >> to URL field. Everything is OK. except that I want after duplication
>> >> detection solr try not to delete all fields of old document. I want some
>> >> fields remain unchanged. For example assume I have a data field called
>> >> "read" with Boolean value "true" for specific document. I want all
>> fields
>> >> of new document overwrites except the value of this field. Is that
>> >> possible? How?
>> >> Regards.
>> >>
>> >> --
>> >> A.Nazemian
>> >>
>> >
>> >
>> >
>> > --
>> > A.Nazemian
>>
>
>
>
> --
> A.Nazemian

Re: solr dedup on specific fields

2014-07-07 Thread Ali Nazemian

Dears,
Is there any way that I can do that in other way?
I mean if you look at my main problem again you will find out that I have
two types of fields in my documents. 1) The ones that should be overwritten
on duplicates, 2) The ones that should not change during duplicates. So Is
it another way to handle this situation from the first place? I mean using
cross join for example?
Assume I have a document with ID 2 which contains all the fields that can
be overwritten. And another document with ID 2 which contains all fields
that should not change during duplication detection. For selecting all
fields it is enough to do join on ID and for Duplication it is enough to
overwrite just document type 1.
Regards.

On Tue, Jul 1, 2014 at 6:17 PM, Alexandre Rafalovitch 
wrote:

> Well, it's implemented in SignatureUpdateProcessorFactory. Worst case,
> you can clone that code and add your preserve-field functionality.
> Could even be a nice contribution.
>
> Regards,
>Alex.
>
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>
>
> On Tue, Jul 1, 2014 at 6:50 PM, Ali Nazemian 
> wrote:
> > Any suggestion would be appreciated.
> > Regards.
> >
> >
> > On Mon, Jun 30, 2014 at 2:49 PM, Ali Nazemian 
> wrote:
> >
> >> Hi,
> >> I used solr 4.8 for indexing the web pages that come from nutch. I know
> >> that solr deduplication operation works on uniquekey field. So I set
> that
> >> to URL field. Everything is OK. except that I want after duplication
> >> detection solr try not to delete all fields of old document. I want some
> >> fields remain unchanged. For example assume I have a data field called
> >> "read" with Boolean value "true" for specific document. I want all
> fields
> >> of new document overwrites except the value of this field. Is that
> >> possible? How?
> >> Regards.
> >>
> >> --
> >> A.Nazemian
> >>
> >
> >
> >
> > --
> > A.Nazemian
>

-- 
A.Nazemian

Re: Disable all caches in Solr

Re: Need of hadoop

Re: Need of hadoop

Re: Exact Match first in the list.

Re: Transparently rebalancing a Solr cluster without splitting or moving shards

Exact Match first in the list.

Re: Transparently rebalancing a Solr cluster without splitting or moving shards

Re: Transparently rebalancing a Solr cluster without splitting or moving shards

Transparently rebalancing a Solr cluster without splitting or moving shards

Re: Long ParNew GC pauses - even when young generation is small

Re: Long ParNew GC pauses - even when young generation is small

Re: java.net.SocketException: Connection reset

Facets on Nested documents

Re: Need of hadoop

Re: java.net.SocketException: Connection reset

Re: java.net.SocketException: Connection reset

Re: Need of hadoop

Re: java.net.SocketException: Connection reset

Re: solr dedup on specific fields

Re: solr dedup on specific fields

Re: solr dedup on specific fields

Re: solr dedup on specific fields

Re: solr dedup on specific fields

Re: Language detection for solr 3.6.1

Need of hadoop

Re: How to index data by hadoop and solr?

Re: solr dedup on specific fields

Re: solr dedup on specific fields

28 matches

Site Navigation

Mail list logo

Footer information